Jump to: Complete Features | Incomplete Features | Complete Epics | Incomplete Epics | Other Complete | Other Incomplete |
Note: this page shows the Feature-Based Change Log for a release
These features were completed when this image was assembled
Through discussion in this issue, https://issues.redhat.com/browse/OCPBUGS-13966 we have decided port 80 can't be support in conjunction with 433 at this time for the default route ingress.
This needs to be documented for 4.14
As an Openshift admin i want to leverage /dev/fuse in unprivileged containers so that to successfully integrate cloud storage into OpenShift application in a secure, efficient, and scalable manner. This approach simplifies application architecture and allows developers to interact with cloud storage as if it were a local filesystem, all while maintaining strong security practices.
As a customer, I would like to deploy OpenShift On OpenStack, using the IPI workflow where my control plane would have 3 machines and each machine would have use a root volume (a Cinder volume attached to the Nova server) and also an attached ephemeral disk using local storage, that would only be used by etcd.
As this feature will be TechPreview in 4.15, this will only be implemented as a day 2 operation for now. This might or might not change in the future.
We know that etcd requires storage with strong performance capabilities and currently a root volume backed by Ceph has difficulties to provide these capabilities.
By also attaching local storage to the machine and mounting it for etcd would solve the performance issues that we saw when customers were using Ceph as the backend for the control plane disks.
Gophercloud already accepts to create a server with multiple ephemeral disks:
We need to figure out how we want to address that in CAPO, probably involving a new API; that later would be used in openshift (MAPO, and probably installer).
We'll also have to update the OpenStack Failure Domain in CPMS.
ARO (Azure) has conducted some benckmarks and is now recommending to put etcd on a separated data disk:
https://docs.google.com/document/d/1O_k6_CUyiGAB_30LuJFI6Hl93oEoKQ07q1Y7N2cBJHE/edit
Also interesting thread: https://groups.google.com/u/0/a/redhat.com/g/aos-devel/c/CztJzGWdsSM/m/jsPKZHSRAwAJ
Once we have defined an API for data volumes, we'll need to add support for this new API in MAPO so the user can update their Machines on day 2 to be redeployed with etcd on local disk.
We only allow usage of controlPlanePort as a TechPreview feature. We should move it to GA.
Open questions:
Installer should no longer accept Kuryr as NetworkType. If user choose it, Installer should show clear error about Kuryr no longer being supported.
Since Kuryr removal we don't need to generate the trunks name anymore. They can be removed.
Kuryr is no longer supported in 4.15 and there cannot be a 4.15 cluster with Kuryr, either a new one or upgraded. Therefore we want to remove Kuryr from must-gather.
Add it to the OOTB runtime classes to allow access without a custom MC
As a cluster-admin I wish to see the status of update and see progress of update on each components.
Background
A common update improvements requested from customer interactions on Update experience is status command
oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0 True True 16s Working towards 4.12.4: 9 of 829 done (1% complete)
Update docs for UX and CLI changes
Epic Goal*
Add a new command `oc adm upgrade status` command which is backed by an API. Please find the mock output of the command output attached in this card.
Why is this important? (mandatory)
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Add control plane update status to "oc adm upgrade status" command output.
Note: In future we want to add "Est Time Remaining: 35m" to this output but need a separate card.
Sample output :
=Control Plane = Assessment: Progressing - Healthy Completion: 45% Duration: 23m Operator Status: 33 Total, 33 Available, 4 Progressing, 0 Degraded
Todo:
This feature is about reducing the complexity of the CAPI install system architecture which is needed for using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift
prerequisite work Goals
Complete the design of the Cluster API (CAPI) architecture and build the core operator logic needed for Phase-1, incorporating the assets from different repositories to simplify asset management.
There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open
As an OpenShift engineer I want the CAPI Providers repositories to use the new generator tool so that they can independently generate CAPI Provider transport ConfigMaps
Once the new CAPI manifests generator tool is ready, we want to make use of that directly from the CAPI Providers repositories so we can avoid storing the generated configuration centrally and independently apply that based on the running platform.
In order to reduce the complexity in the system we are proposing to get rid of the upstream cluster-api operator (kubernetes-sigs/cluster-api-operator). We plan to replace the responsibility of this component, which at the moment is responsible for reading, fetching and installing the desired providers in cluster, by implementing them directly in the downstream openshift/cluster-capi-operator.
As an OpenShift engineer I want to be able to install the new manifest generation tool as a standalone tool in my CAPI Infra Provider repo to generate the CAPI Provider transport ConfigMap(s)
Renaming of the CAPI Asset/Manifest generator from assets (generator) to manifest-gen, as it won't need to generate go embeddable assets anymore, but only manifests that will be referenced and applied by CVO
Add support for Johannesburg, South Africa (africa-south1) in GCP
As a user I'm able to deploy OpenShift in Johannesburg, South Africa (africa-south1) in GCP and this region is fully supported
A user can deploy OpenShift in GCP Johannesburg, South Africa (africa-south1) using all the supported installation tools for self-managed customers.
The support of this region is backported to the previuos OpenShift EUS release.
Google Cloud has added support for a new region in their public cloud offering and this region needs to be supported for OpenShift deployments as other regions.
The information of the new region needs to be added to the documentation so this is supported.
Epic Goal*
Provide a long term solution to SELinux context labeling in OCP.
Why is this important? (mandatory)
As of today when selinux is enabled, the PV's files are relabeled when attaching the PV to the pod, this can cause timeout when the PVs contains lot of files as well as overloading the storage backend.
https://access.redhat.com/solutions/6221251 provides few workarounds until the proper fix is implemented. Unfortunately these workaround are not perfect and we need a long term seamless optimised solution.
This feature tracks the long term solution where the PV FS will be mounted with the right selinux context thus avoiding to relabel every file.
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
As we are relying on mount context there should not be any relabeling (chcon) because all files / folders will inherit the context from the mount context
More on design & scenarios in the KEP and related epic STOR-1173
Dependencies (internal and external) (mandatory)
None for the core feature
However the driver will have to set SELinuxMountSupported to true in the CSIDriverSpec to enable this feature.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Support upstream feature "SELinux relabeling using mount options (CSIDriver API change)"" in OCP as Beta, i.e. test it and have docs for it (unless it's Alpha upstream).
Summary: If Pod has defined SELinux context (e.g. it uses "resticted" SCC) and it uses ReadWriteOncePod PVC and CSI driver responsible for the volume supports this feature, kubelet + the CSI driver will mount the volume directly with the correct SELinux labels. Therefore CRI-O does not need to recursive relabel the volume and pod startup can be significantly faster. We will need a thorough documentation for this.
This upstream epic actually will be implemented by us!
Test that the metrics described in the KEP provide useful data. I.e. check that volume_manager_selinux_volume_context_mismatch_warnings_total increases when there are two Pods that have two different SELinux contexts and use the same volume and different subpath of it.
Add metrics described in the upstream KEP to telemetry, so we know how many clusters / Pod would be affected when we expose SELinux mount to all volume types.
We want:
As a cluster user, I want to use mounting with SELinux context without any configuration.
This means OCP ships CSIDriver objects with "SELinuxMount: true" for CSI drivers that support mounting with "-o context". I.e. all CSI drivers that are based on block volumes and use ext4/xfs should have this enabled.
Add support for Dammam, Saudi Arabia, Middle East (me-central2) region in GCP
As a user I'm able to deploy OpenShift in Dammam, Saudi Arabia, Middle East (me-central2) region in GCP and this region is fully supported
A user can deploy OpenShift in GCP Dammam, Saudi Arabia, Middle East (me-central2) region using all the supported installation tools for self-managed customers.
The support of this region is backported to the previuos OpenShift EUS release.
Google Cloud has added support for a new region in their public cloud offering and this region needs to be supported for OpenShift deployments as other regions.
The information of the new region needs to be added to the documentation so this is supported.
Elaborate more dashboards (monitoring dashboards, accessible from menu Observe > Dashboards ; admin perspective) related to networking.
Start with just a couple of areas:
More info/discussion in this work doc: https://docs.google.com/document/d/1ByNIJiOzd6w5csFYpC27NdOydnBg8Tx45uL4-7v-aCM/edit
Elaborate more dashboards (monitoring dashboards, accessible from menu Observe > Dashboards ; admin perspective) related to networking.
Start with just a couple of areas:
More info/discussion in this work doc: https://docs.google.com/document/d/1ByNIJiOzd6w5csFYpC27NdOydnBg8Tx45uL4-7v-aCM/edit
Martin Kennelly is our contact point from the SDN team
Create a dashboard from the CNO
Current metrics documentation:
Include metrics for:
The OpenShift Assisted Installer is a user-friendly OpenShift installation solution for the various platforms, but focused on bare metal. This very useful functionality should be made available for the IBM zSystem platform.
Use of the OpenShift Assisted Installer to install OpenShift on an IBM zSystem
Using the OpenShift Assisted Installer to install OpenShift on an IBM zSystem
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
As a multi-arch development engineer, I would like to ensure that the Assisted Installer workflow is fully functional and supported for z/VM deployments.
Acceptance Criteria
Description of the problem:
Beside the API there is no possibility to provide the additional kernel arguments to the coreos installer.
In case of zVM installation there are additional kernel arguments required to enable network or storage devices correctly. If not provided the node end up in a emergency shell if node is being rebooted after coreos installation.
This is an example of an API call to install a zVM node. This call need to be executed node specific and after discovery of a node:
curl https://api.openshift.com/api/assisted-install/v2/infra-envs/$
{INFRA_ENV_ID}/hosts/$1/installer-args \
-X PATCH \
-H "Authorization: Bearer ${API_TOKEN}" \
-H "Content-Type: application/json" \
-d '{{ {
"args": [
"--append-karg", "rd.neednet=1",
"--append-karg", "ip=10.14.6.4::10.14.6.1:255.255.255.0:master-1.boea3e06.lnxero1.boe:encbdd0:none",
"--append-karg", "nameserver=10.14.6.1",
"--append-karg", "ip=[fd00::4]::[fd00::1]:64::encbdd0:none",
"--append-karg", "nameserver=[fd00::1]",
"--append-karg", "zfcp.allow_lun_scan=0",
"--append-karg", "rd.znet=qeth,0.0.bdd0,0.0.bdd1,0.0.bdd2,layer2=1",
"--append-karg", "rd.zfcp=0.0.8003,0x500507630400d1e3,0x4000404700000000",{{
"--append-karg", "rd.zfcp=0.0.8103,0x500507630400d1e3,0x4000404700000000"}} ] }{}' | jq
The kernel arguments might differ between discovered nodes.
This applies to zVM only. On KVM (s390x) these additional kernel arguments are not needed.
How reproducible:
Configure a zVM node (in case of SNO) or at least 3 zVM nodes, create a new cluster by choosing a cluster option (SNO or failover) and discover the nodes.
On the UI there is no option to enter the required kernel arguments for the coreos installer.
After verification start installation. Installation failed because nodes could not be rebooted.
Steps to reproduce:
1. Configure at least one zVM node (for SNO) or three for failover accordingly.
2. Discover these nodes
3. Start installation after all validation steps are passed.
Actual results:
Installation failed because cmdline does not contain necessary kernel arguments and the first reboot is failing (nodes are booting into emergency shell).
No option to enter additional kernel arguments on UI.
Expected results:
In case of zVM and Assisted-Installer UI, the user is able to specify the necessary kernel arguments for each discovered zVM nodes. These kernel arguments are passed to the coreos installer and the Installation is successful.
The storage operators need to be automatically restarted after the certificates are renewed.
From OCP doc "The service CA certificate, which issues the service certificates, is valid for 26 months and is automatically rotated when there is less than 13 months validity left."
Since OCP is now offering an 18 months lifecycle per release, the storage operator pods need to be automatically restarted after the certificates are renewed.
The storage operators will be transparently restarted. The customer benefit should be transparent, it avoids manually restart of the storage operators.
The administrator should not need to restart the storage operator when certificates are renew.
This should apply to all relevant operators with a consistent experience.
As an administrator I want the storage operators to be automatically restarted when certificates are renewed.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
This feature request is triggered by the new extended OCP lifecycle. We are moving from 12 to 18 months support per release.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
No doc is required
This feature only cover storage but the same behavior should be applied to every relevant components.
The pod `vsphere-problem-detector-operator` mounts the secret:
$ cat assets/vsphere_problem_detector/07_deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: vsphere-problem-detector-operator spec: containers: volumeMounts: - mountPath: /var/run/secrets/serving-cert name: vsphere-problem-detector-serving-cert volumes: - name: vsphere-problem-detector-serving-cert secret: secretName: vsphere-problem-detector-serving-cert
Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted
The pod `csi-snapshot-webhook` mounts the secret:
```
$ cat assets/webhook/deployment.yaml kind: Deployment metadata: name: csi-snapshot-webhook ... spec: template: spec: containers: volumeMounts: - name: certs mountPath: /etc/snapshot-validation-webhook/certs volumes: - name: certs secret: secretName: csi-snapshot-webhook-secret
```
Hence, if the secret is updated (e.g. as a result of CA cert update), the Pod must be restarted.
Description of problem:
Even though in 4.11 we introduced LegacyServiceAccountTokenNoAutoGeneration to be compatible with upstream K8s to not generate secrets with tokens when service accounts are created, today OpenShift still creates secrets and tokens that are used for legacy usage of openshift-controller as well as the image-pull secrets.
Customer issues:
Customers see auto-generated secrets for service accounts which is flagged as a security risk.
This Feature is to track the implementation for removing legacy usage and image-pull secret generation as well so that NO secrets are auto-generated when a Service Account is created on OpenShift cluster.
NO Secrets to be auto-generated when creating service accounts
Following *secrets need to NOT be generated automatically with every Serivce account creation:*
Use Cases (Optional):
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
Concerns/Risks: Replacing functionality of one of the openshift-controller used for controllers that's been in the code for a long time may impact behaviors that w
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Existing documentation needs to be clear on where we are today and why we are providing the above 2 credentials. Related Tracker: https://issues.redhat.com/browse/OCPBUGS-13226
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Openshift-controller-manager has a controller that automatically creates "managed" service accounts that support OpenShift features in every namespace. In OCP 4, this was effectively hard-coded to create the "builder" and "deployer" service accounts.
This controller should be refactored so we have separate instances for the builder and deployer service account, respectively. This will let us disable said controllers via the Capabilities API and ocm-o.
Create custom roles for GCP with minimal set of required permissions.
Enable customers to better scope credential permissions and create custom roles on GCP that only include the minimum subset of what is needed for OpenShift.
Some of the service accounts that CCO creates, e.g. service account with role roles/iam.serviceAccountUser provides elevated permissions that are not required/used by the requesting OpenShift components. This is because we use predefined roles for GCP that come with bunch of additional permissions. The goal is to create custom roles with only the required permissions.
TBD
Evaluate if any of the GCP predefined roles in the credentials request manifests of OpenShift cluster operators give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
The new GCP provider spec for credentials request CR is as follows:
type GCPProviderSpec struct { metav1.TypeMeta `json:",inline"` // PredefinedRoles is the list of GCP pre-defined roles // that the CredentialsRequest requires. PredefinedRoles []string `json:"predefinedRoles"` // Permissions is the list of GCP permissions required to // create a more fine-grained custom role to satisfy the // CredentialsRequest. // When both Permissions and PredefinedRoles are specified // service account will have union of permissions from // both the fields Permissions []string `json:"permissions"` // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions // have the necessary services enabled // +optional SkipServiceCheck bool `json:"skipServiceCheck,omitempty"` }
we can use the following command to check permissions associated with a GCP predefined role
gcloud iam roles describe <role_name>
The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.
[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer
Evaluate if any of the GCP predefined roles in the credentials request manifest of Cloud Controller Manager Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
The new GCP provider spec for credentials request CR is as follows:
type GCPProviderSpec struct { metav1.TypeMeta `json:",inline"` // PredefinedRoles is the list of GCP pre-defined roles // that the CredentialsRequest requires. PredefinedRoles []string `json:"predefinedRoles"` // Permissions is the list of GCP permissions required to // create a more fine-grained custom role to satisfy the // CredentialsRequest. // When both Permissions and PredefinedRoles are specified // service account will have union of permissions from // both the fields Permissions []string `json:"permissions"` // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions // have the necessary services enabled // +optional SkipServiceCheck bool `json:"skipServiceCheck,omitempty"` }
we can use the following command to check permissions associated with a GCP predefined role
gcloud iam roles describe <role_name>
The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.
[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer
Evaluate if any of the GCP predefined roles in the credentials request manifest of machine api operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
The new GCP provider spec for credentials request CR is as follows:
type GCPProviderSpec struct { metav1.TypeMeta `json:",inline"` // PredefinedRoles is the list of GCP pre-defined roles // that the CredentialsRequest requires. PredefinedRoles []string `json:"predefinedRoles"` // Permissions is the list of GCP permissions required to // create a more fine-grained custom role to satisfy the // CredentialsRequest. // When both Permissions and PredefinedRoles are specified // service account will have union of permissions from // both the fields Permissions []string `json:"permissions"` // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions // have the necessary services enabled // +optional SkipServiceCheck bool `json:"skipServiceCheck,omitempty"` }
we can use the following command to check permissions associated with a GCP predefined role
gcloud iam roles describe <role_name>
The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.
[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer
Evaluate if any of the GCP predefined roles in the credentials request manifest of Cluster CAPI Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
The new GCP provider spec for credentials request CR is as follows:
type GCPProviderSpec struct { metav1.TypeMeta `json:",inline"` // PredefinedRoles is the list of GCP pre-defined roles // that the CredentialsRequest requires. PredefinedRoles []string `json:"predefinedRoles"` // Permissions is the list of GCP permissions required to // create a more fine-grained custom role to satisfy the // CredentialsRequest. // When both Permissions and PredefinedRoles are specified // service account will have union of permissions from // both the fields Permissions []string `json:"permissions"` // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions // have the necessary services enabled // +optional SkipServiceCheck bool `json:"skipServiceCheck,omitempty"` }
we can use the following command to check permissions associated with a GCP predefined role
gcloud iam roles describe <role_name>
The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.
[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer
Evaluate if any of the GCP predefined roles in the credentials request manifest of Cluster Image Registry Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
The new GCP provider spec for credentials request CR is as follows:
type GCPProviderSpec struct { metav1.TypeMeta `json:",inline"` // PredefinedRoles is the list of GCP pre-defined roles // that the CredentialsRequest requires. PredefinedRoles []string `json:"predefinedRoles"` // Permissions is the list of GCP permissions required to // create a more fine-grained custom role to satisfy the // CredentialsRequest. // When both Permissions and PredefinedRoles are specified // service account will have union of permissions from // both the fields Permissions []string `json:"permissions"` // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions // have the necessary services enabled // +optional SkipServiceCheck bool `json:"skipServiceCheck,omitempty"` }
we can use the following command to check permissions associated with a GCP predefined role
gcloud iam roles describe <role_name>
The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.
[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
These are phase 2 items from CCO-188
Moving items from other teams that need to be committed to for 4.13 this work to complete
Evaluate if any of the GCP predefined roles in the credentials request manifest of Cluster Storage Operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
The new GCP provider spec for credentials request CR is as follows:
type GCPProviderSpec struct { metav1.TypeMeta `json:",inline"` // PredefinedRoles is the list of GCP pre-defined roles // that the CredentialsRequest requires. PredefinedRoles []string `json:"predefinedRoles"` // Permissions is the list of GCP permissions required to // create a more fine-grained custom role to satisfy the // CredentialsRequest. // When both Permissions and PredefinedRoles are specified // service account will have union of permissions from // both the fields Permissions []string `json:"permissions"` // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions // have the necessary services enabled // +optional SkipServiceCheck bool `json:"skipServiceCheck,omitempty"` }
we can use the following command to check permissions associated with a GCP predefined role
gcloud iam roles describe <role_name>
The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.
[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer
Evaluate if any of the GCP predefined roles in the credentials request manifest of cloud credentials operator give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
The new GCP provider spec for credentials request CR is as follows:
type GCPProviderSpec struct { metav1.TypeMeta `json:",inline"` // PredefinedRoles is the list of GCP pre-defined roles // that the CredentialsRequest requires. PredefinedRoles []string `json:"predefinedRoles"` // Permissions is the list of GCP permissions required to // create a more fine-grained custom role to satisfy the // CredentialsRequest. // When both Permissions and PredefinedRoles are specified // service account will have union of permissions from // both the fields Permissions []string `json:"permissions"` // SkipServiceCheck can be set to true to skip the check whether the requested roles or permissions // have the necessary services enabled // +optional SkipServiceCheck bool `json:"skipServiceCheck,omitempty"` }
we can use the following command to check permissions associated with a GCP predefined role
gcloud iam roles describe <role_name>
The sample output for role roleViewer is as follows. The permission are listed in "includedPermissions" field.
[akhilrane@localhost cloud-credential-operator]$ gcloud iam roles describe roles/iam.roleViewer
description: Read access to all custom roles in the project.
etag: AA==
includedPermissions:
- iam.roles.get
- iam.roles.list
- resourcemanager.projects.get
- resourcemanager.projects.getIamPolicy
name: roles/iam.roleViewer
stage: GA
title: Role Viewer
Rather than create custom roles per-cluster, as is currently implemented for GCP, ccoctl should create custom roles per-project due to custom role deletion policies. When a custom role is deleted in GCP it continues to exist and contributes to quota for 7 days. Custom roles are not permanently deleted for up to 14 days after deletion ref: https://cloud.google.com/iam/docs/creating-custom-roles#deleting-custom-role.
Deletion should ignore these per-project custom roles by default and provide an optional flag to delete them.
Since the custom roles must be created per-project, deltas in permissions must be additive. We can't remove permissions with these restrictions since previous versions may rely on those custom role permissions.
Post a warning/info message regarding the permission delta so that users are aware that there are extra permissions and they can clean them up possibly if they're sure they aren't being utilized.
Support Serving OpenShift release signatures via Cincinnati. This can serve mostly disconnected use case.
Currently for disconnected OCP image mirroring we need to create and configure a configmap as mentioned here
Connected/disconnected Cincinnati can mirror signatures from their upstream locations without creating configmap using oc-mirror command.
Also, load signatures from a graph-data container image, for the restricted/disconnected-network case.
In the process of mirroring images for a disconnected installation using the "oc-mirror" command, currently signature files located in the release-signatures folder are missing. Currently the files are manually applied to the "openshift-config-managed" namespace. Without this manual step any cluster trying to upgrade fails due to the error the versions are not signed/verified.
Serving OpenShift release signatures via Cincinnati would allow us to have a single service for update related metadata, namely a Cincinnati deployment on the local network, which the CVO will be configured to poll. This would make restricted/disconnected-network updates easier, by reducing the amount of pre-update cluster adjustments (no more need to create signature ConfigMaps in each cluster being updated).
Connected Cincinnati can mirror signatures from their upstream locations.
Cincinnati can also be taught to load signatures from a graph-data container image, for the restricted/disconnected-network case.
Update documentation to remove the need for configmaps
This impacts oc mirror . There are 2 ways to mirror images as mentioned here .
Epic Goal*
Serving OpenShift release signatures via Cincinnati. This is followup to the OTA-908 epic. In OTA-908 we are focused with respect to the disconnected use case and this epic focuses on getting the same feature working in the hosted OSUS.
Why is this important? (mandatory)
Serving OpenShift release signatures via Cincinnati would allow us to have a single service for update related metadata, namely a Cincinnati deployment on the local network, which the CVO will be configured to poll. This would make restricted/disconnected-network updates easier, by reducing the amount of pre-update cluster adjustments (no more need to create signature ConfigMaps in each cluster being updated).
Scenarios (mandatory)
Current customer workflows like signature ConfigMap creation and ConfigMap application would no longer be required. Instead, cluster-version operators in restricted/disconnected-networks could fetch the signature data from the local OpenShift Update Service (Cincinnati).
Dependencies (internal and external) (mandatory)
The oc-mirror team will need to review/approve oc-mirror changes to take advantage of the new functionality for customers using {{oc-mirror} workflows.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Bringing in and implementing the new spec property from OTA-916 (or, if OTA-916 ends up settling on "magically construct a signature store from upstream, implementing that).
This might be helpful context for constructing composite signature stores. Existing CVO-side integration is here, and that would get a bit more complicated with the need to dynamically manage spec-configured signature stores.
As an Infrastructure Administrator, I want to deploy OpenShift on Nutanix distributing the control plane and compute nodes across multiple regions and zones, forming different failure domains.
As an Infrastructure Administrator, I want to configure an existing OpenShift cluster to distribute the nodes across regions and zones, forming different failure domains.
Install OpenShift on Nutanix using IPI / UPI in multiple regions and zones.
This implementation would follow the same idea that has been done for vSphere. The following are the main PRs for vSphere:
https://github.com/openshift/enhancements/blob/master/enhancements/installer/vsphere-ipi-zonal.md
Nutanix Zonal: Multiple regions and zones support for Nutanix IPI and Assisted Installer
Note
As a user, I want to be able to spread control plane nodes for an OCP clusters across Prism Elements (zones).
This is a followup to https://issues.redhat.com/browse/OPNET-13. In that epic we implemented limited support for dual stack on VSphere, but due to limitations in upstream Kubernetes we were not able to support all of the use cases we do on baremetal. This epic is to track our work up and downstream to finish the dual stack implementation.
This is a followup to https://issues.redhat.com/browse/OPNET-13. In that epic we implemented limited support for dual stack on VSphere, but due to limitations in upstream Kubernetes we were not able to support all of the use cases we do on baremetal. This epic is to track our work up and downstream to finish the dual stack implementation.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
This work will require updates to the core OpenShift API repository to add the new platform type, and then a distribution of this change to all components that use the platform type information. For components that partners might replace, per-component action will need to be taken, with the project team's guidance, to ensure that the component properly handles the "External" platform. These changes will look slightly different for each component.
To integrate these changes more easily into OpenShift, it is possible to take a multi-phase approach which could be spread over a release boundary (eg phase 1 is done in 4.X, phase 2 is done in 4.X+1).
OCPBU-5: Phase 1
OCPBU-510: Phase 2
OCPBU-329: Phase.Next
Phase 1
Phase 2
Phase 3
In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.
The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.
In the context of the Machine Config Operator (MCO) in Red Hat OpenShift, on-cluster builds refer to the process of building an OS image directly on the OpenShift cluster, rather than building them outside the cluster (such as on a local machine or continuous integration (CI) pipeline) and then making a configuration change so that the cluster uses them. By doing this, we enable cluster administrators to have more control over the contents and configuration of their clusters’ OS image through a familiar interface (MachineConfigs and in the future, Dockerfiles).
Support Platform external to allow installing with agent on OCI, with focus on https://www.oracle.com/cloud/cloud-at-customer/dedicated-region/faq/ for disconnected, on-prem.
OCPSTRAT-510 OpenShift on Oracle Cloud Infrastructure (OCI) with VMs
Support Platform external to allow installing with agent on OCI, with focus on https://www.oracle.com/cloud/cloud-at-customer/dedicated-region/faq/ for disconnected, on-prem
As a user, I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Review the OVN Interconnect proposal, figure out the work that needs to be done in ovn-kubernetes to be able to move to this new OVN architecture.
OVN IC will be the model used in Hypershift.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
This is nice but requires a lot of effort on the customer to create such a plugin. It therefore would be nice to provide certain templates for such dynamic templates in order to help customers to get started. The templates could be for certain common use cases such as a page that allows viewing customer specific metric dashboards.
Similar what is available in the OpenShift 4 - Console under Observe with the dashboards but the ability to specify queries based on customer specific metrics and then define the type of the graph.
With the given template, it would be easier for customers to start adopting that functionality and continue building additional functionality if and where needed.
Dynamic Plugins is great but requires some effort to get started if somebody is not super familiar with the SDK made available. Therefore providing templates for certain use cases would be helpful for customers and even partners to increase knowledge faster and get started with the dynamic plugin functionality.
Currently there are three cases where the metrics are being displayed in the console:
For each of them we should have an example in the Metrics Template.
This document contains all the knowledge on the metrics use-cases and their requirements.
Visualising custom metrics is a common ask and therefore a potential good example to create such a template.
Since `QueryBrowser` component was exposed in the console-dynamic-plugin-sdk we should showcase its usage.
One of the use cases should be to render a chart, using the `QueryBrowser` component, in a dashboard card. For the we would use the console.dashboard/card console extension.
AC:
As a cluster-admin I want to see conditional update path for releases on HCP. I want HCP to consume update recommendations from OpenShift Update Service (OSUS) similar to self-managed OCP clusters. And CVO of a hosted cluster should be able to evaluate conditional updates similar to self-managed OCP clusters.
Background.
Follow up from OTA-791{}
At a high level HCP updates should be handled by CVO and features like OSUS, conditional update should also be available in HCP. This Feature is a continuation of work OTA team is doing around CVO and HCP. **
Follow up from OTA-791
Epic Goal*
The Goal of this Epic is to:
As part of the first phase, evaluation of conditional updates containing PromQL risks is implemented for self-managed HyperShift deployed on an OpenShift management cluster.
Why is this important? (mandatory)
What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Once the CVO has configurable knobs for its PromQL engine (OTA-854), teach the hosted-control-plane operator's CVO controller to set those knobs to point them at the management-cluster's Thanos.
Definition of Done:
* teach the hosted-control-plane operator's CVO controller to set those knobs to point them at the management-cluster's Thanos.
HyperShift is considering allowing hosted clusters to avoid having a monitoring stack (ADR-30, OBSDA-242). Platform metrics for the hosted clusters contain the data we've used so far for conditional update risks, and those will be scraped by Prometheus on the management cluster, retained for >=24h, and remote-written to a Red Hat Observatorium cluster. Current conditional update risks have never gone beyond max_over_time(...[1h]) to smear over pod restarts and such, so local Prometheus/Thanos should have plenty of data for us.
This ticket is about adding knobs to the CVO so HyperShift can point it at that management-cluster Thanos.
Follow-up work will teach HyperShift to tune those knobs when creating the CVO deployment.
We need to enhance cluster network operator to automate the whole SDN live-migration.
During the migration, a node will start as an SDN node (a hybrid overlay node from OVN-K perspective), then become an OVN-K node. So OVN-K needs to support such dynamical role switching.
Task to track post-merge testing for this epic
Goal:
Enable and support Multus CNI for microshift.
Background:
Customers with advanced networking requirement need to be able to attach additional networks to a pod, e.g. for high-performance requirements using SR-IOV or complex VLAN setups etc.
Requirements:
Documentation:
Testing:
Customer Considerations:
Out of scope:
(contacting ART to setup image build is another task)
More details at ARO managed identity scope and impact.
This Section: A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
This document describes the expectations for promoting a feature that is behind a feature gate.
The criteria includes:
Enable installation and lifecycle support of OpenShift 4 on Oracle Cloud Infrastructure (OCI) with VMs
Currently, we don't yet support OpenShift 4 on Oracle Cloud Infrastructure (OCI), and we know from initial attempts that installing OpenShift on OCI requires the use of a qcow (OpenStack qcow seems to work fine), networking and routing changes, storage issues, potential MTU and registry issues, etc.
TBD based on customer demand.
Why is this important
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
RFEs:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
Other
Allow users to upload multi-document YAML files.
Users must be able to upload multi-document YAML files through the UI and API.
Yes, since users might not be aware they can use this method in order to upload many
The BE API allows multi doc yamls to be uploaded as part of custom manifests, but it looks like openshift installer doesn't handle multi-doc yaml.
A solution could be to split the multi doc yaml file in several yaml files.
Based on the work to enable OpenShift core platform components to support Azure Identity (captured in OCPBU-8), a standardized flow exists for OLM-managed operators to interact with the cluster in a specific way to leverage Azure Identity for authorization when using Azure APIs using short-lived tokens as opposed to insecure static, long-lived credentials. OLM-managed operators can implement integration with the CloudCredentialOperator in well-defined way to support this flow.
Enable customers to easily leverage OpenShift's capabilities around Azure Identity for short-lived authentication tokens with layered products, for increased security posture. Enable OLM-managed operators to implement support for this in well-defined pattern.
See Operators & STS slide deck. It refers to AWS STS as an example, but conceptually the same use case and workflow applies to Azure identity.
The CloudCredentialsOperator already provides a powerful API for OpenShift's cluster core operator to request credentials and acquire them via short-lived tokens. This capability should be expanded to OLM-managed operators, specifically to Red Hat layered products that interact with Azure APIs. The process today is cumbersome to none-existent based on the operator in question and seen as an adoption blocker of OpenShift on Azure.
This is particularly important for ARO customers. Customers are expected to be asked to pre-create the required IAM roles outside of OpenShift, which is deemed acceptable.
See overview, goals, requirements, use cases, scope and background, etc as per https://issues.redhat.com/browse/OCPBU-560
Plan is to backport these changes to 4.14 so that they can be used on HCP / ROSA
Summary:
Similar to work done for AWS STS (https://issues.redhat.com/browse/CCO-286), enable in CCO a new workflow (see EP PR [here|https://github.com/openshift/enhancements/pull/1339)] to detect when temporary authentication tokens (TAT) (workload identity credential) are in use on a cluster. {{}}
Important details:
Detection that workload identity credentials (TAT) are in use will mean CCO is in Manual mode and the Service Account has a non-empty Service Account Issuer field.
This workflow will be triggered by additions to the CredentialsRequest for an operator:
spec.cloudTokenString
spec.cloudTokenPath
Acceptance Criteria:
When OCP is running on an Azure platform with temporary authentication tokens enabled, CCO will detect this and on the presence of properly annotated CredentialsRequest create a Secret to allow Azure SDK calls for Azure resources to succeed.
Console enhancements based on customer RFEs that improve customer user experience.
Requirement | Notes | isMvp? |
---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
According to security best practice, it's recommended to set readOnlyRootFilesystem: true for all containers running on kubernetes. Given that openshift-console does not set that explicitly, it's requested that this is being evaluated and if possible set to readOnlyRootFilesystem: true or otherwise to readOnlyRootFilesystem: false with a potential explanation why the file-system needs to be write-able.
3. Why does the customer need this? (List the business requirements here)
Extensive security audits are run on OpenShift Container Platform 4 and are highlighting that many vendor specific container is missing to set readOnlyRootFilesystem: true or else justify why readOnlyRootFilesystem: false is set.
AC: Set up readOnlyRootFilesystem field on both console and console-operator deployment's spec. Part of the work is to determine the value. True if the pod if not doing any writing to its filesystem, otherwise false.
1. Proposed title of this feature request
Add option to enable/disable tailing to log viewer
2. What is the nature and description of the request?
See https://issues.redhat.com/browse/OCPBUGS-362
3. Why does the customer need this?
See https://issues.redhat.com/browse/OCPBUGS-362
4. List any affected packages or components.
Management Console
AC: Add functionality for tailing logs based on UX input. This functionality should be done for both Pod logs view. Add integration test.
UX draft - https://docs.google.com/document/u/2/d/1C9lO4JvUesAIn9U5m7Q98Tx4zw77sQGtzCJ4Wh6FAK0/edit?usp=sharing
UX contact - Tal Tobias
Show node uptime information in the Openshift Console.
When the user logs into the OpenShift web console and goes to the Nodes section, it doesn't display the uptime information of each node. Currently, it only shows the date when the node was created.
Customer wants to have additional info related to the time a node is up, i.e. since when the node is up, so it can be useful for tracking node restarts or failures.
Unify and update hosted control planes storage operators so that they have similar code patterns and can run properly in both standalone OCP and HyperShift's control plane.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Epic Goal*
Our current design of EBS driver operator to support Hypershift does not scale well to other drivers. Existing design will lead to more code duplication between driver operators and possibility of errors.
Why is this important? (mandatory)
An improved design will allow more storage drivers and their operators to be added to hypershift without requiring significant changes in the code internals.
Scenarios (mandatory)
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Finally switch both CI and ART to the refactored aws-ebs-csi-driver-operator.
The functionality and behavior should be the same as the existing operator, however, the code is completely new. There could be some rough edges. See https://github.com/openshift/enhancements/blob/master/enhancements/storage/csi-driver-operator-merge.md
Ci should catch the most obvious errors, however, we need to test features that we do not have in CI. Like:
Out CSI driver YAML files are mostly copy-paste from the initial CSI driver (AWS EBS?).
As OCP engineer, I want the YAML files to be generated, so we can keep consistency among the CSI drivers easily and make them less error-prone.
It should have no visible impact on the resulting operator behavior.
Goal
Guided installation user experience that interacts via prompts for necessary inputs, informs of erroneous/invalid inputs, and provides status and feedback throughout the installation workflow with very few steps, that works for disconnected, on-premises environments.
Installation is performed from a bootable image that doesn't contain cluster details or user details, since these details will be collected during the installation flow after booting the image in the target nodes.
This means that the image is generic and can be used to install an OpenShift cluster in any supported environment.
Why is this important?
Customers/partners desire a guided installation experience to deploy OpenShift with a UI that includes support for disconnected, on-premises environments, and which is as flexible in terms of configuration as UPI.
We have partners that need to provide an installation image that can be used to install new clusters on any location and for any users, since their business is to sell the hardware along with OpenShift, where OpenShift needs to be installable in the destination premises.
Acceptance Criteria
This experience should provide an experience closely matching the current hosted service (Assisted Installer), with the exception that it is limited to a single cluster because the host running the service will reboot and become a node in the cluster as part of the deployment process.
Dependencies
Modify the cluster registration code in the assisted-service client (used by create-cluster-and-infraenv.service) to allow creating the cluster given only the following config manifests:
If the following manifests are present, data from them should be used:
Other manifests (ClusterDeployment, AgentClusterInstall) will not be present in an interactive install, and the information therein will be entered via the GUI instead.
A CLI flag or environment variable can be used to select the interactive mode.
create-cluster-and-infraenv.service will be split into agent-register-cluster.service and agent-register-infraenv.service.
Any existing systemd service dependency on create-cluster-and-infraenv.service should be moved to agent-register-infraenv.service.
Acceptance Criteria:
Upstream Kuberenetes is following other SIGs by moving it's intree cloud providers to an out of tree plugin format, Cloud Controller Manager, at some point in a future Kubernetes release. OpenShift needs to be ready to action this change
GA of the cloud controller manager for the GCP platform
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
To make the CCM GA, we need to update the switch case in library go to make sure the GCP CCM is always considered external.
We then need to update the vendor in KCMO, CCMO, KASO and MCO.
Track goals/requirements for self-managed GA of Hosted control planes on BM using the agent provider. Mainly make sure:
This Section:
Customers are looking at HyperShift to deploy self-managed clusters on Baremetal. We have positioned the Agent flow as the way to get BM clusters due to its ease of use (it automates many of the rather mundane tasks required to setup up BM clusters) and its planned for GA with MCE 2.3 (in the OCP 4.13 timeframe).
Questions to be addressed:
To run a HyperShift management cluster in disconnected mode we need to document which images need to be mirrored and potentially modify the images we use for OLM catalogs.
ICSP mapping only happens for image references with a digest, not a regular tag. We need to address this for images we reference by tag:
CAPI, CAPI provider, OLM catalogs
Group all tasks for CAPI-provider-agent GA readiness
no
Feature origin (who asked for this feature?)
Enable support to bring your own encryption key (BYOK) for OpenShift on IBM Cloud VPC.
As a user I want to be able to provide my own encryption key when deploying OpenShift on IBM Cloud VPC so the cluster infrastructure objects, VM instances and storage objects, can use that user-managed key to encrypt the information.
The Installer will provide a mechanism to specify a user-managed key that will be used to encrypt the data on the virtual machines that are part of the OpenShift cluster as well as any other persistent storage managed by the platform via Storage Classes.
This feature is a required component for IBM's OpenShift replatforming effort.
The feature will be documented as usual to guide the user while using their own key to encrypt the data on the OpenShift cluster running on IBM Cloud VPC
Epic Goal*
Review and support the IBM engineering team while enabling BYOK support for OpenShift on IBM Cloud VPC
Why is this important? (mandatory)
As part of the replatform work IBM is doing for their OpenShift managed service this feature is Key for that work
Scenarios (mandatory)
All the cluster storage objects, VMs storage and storage managed by StorageClass objects defined in the platform., will be encrypted using the user-managed key provided in the installation manifest
Dependencies (internal and external) (mandatory)
https://issues.redhat.com/browse/CORS-2694
Contributing Teams(and contacts) (mandatory)
The IBM development team will be responsible of developing this feature and RH engineering team will review and support their work
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
As a user, I am able to provide my own key for boot volume encryption when deploying OpenShift on IBM Cloud VPC.
Provide encryption key to infrastructure for provisioning of control plane nodes.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Add support to the OpenShift Installer to specify a custom MTU to be used for the cluster network
As a user, I need to be able to provide a custom MTU to be consumed by the OpenShift Installer so I can change the cluster network MTU on day-0
The OpenShift Installer will accept another parameter through the install-config manifest that will be used as an option to change the default MTU value for the cluster network
Customers who are running OpenShift on environments where the network traffic between the cluster and external endpoints is limited to a certain MTU. Some examples are OpenShift clusters running on public clouds like AWS, Azure or GCP where these clusters are connected to external services running on the customer premises via direct links, VPNs, etc... and limited to a certain MTU
Additional background can be found in RFE-4009
As a new option will be added to the Installer manifest this will need to be documented as any other.
Configure vSphere integration with credentials specified in the install-config.yaml file used by the agent install ISO image so that the platform integrations is configured on day-1 with the agent-based installer.
Stop ignoring any (optional) vSphere credential values provided in the install-config, and pass them instead to the cluster. This allows users to configure the credentials at day 1 if they want to, though it should remain optional. It also means that an install-config usable for IPI (where the credentials are required) should result in an equivalent cluster when the agent installer is used.
Currently there are warning messages logged about these values being ignored when provided. These warnings should be removed when the values are no longer ignored.
In the absence of direct API support for this in assisted, we should be able to use install-config overrides (which the agent installer is already able to make use of internally).
The json attribute name is named "user", but because the yaml.Marshal function is used, it fails to interpret that "user" defined in json is meant to represent the vSphere.vcenters.Username.
The fix is to change yaml.Marshal to json.Marshal in internal/installconfig/builder GetInstallConfig.
time="2023-09-25T21:29:07Z" level=error msg="error running openshift-install create manifests, stdout: level=warning msg=failed to parse first occurrence of unknown field: failed to unmarshal install-config.yaml: error unmarshaling JSON: while decoding JSON: json: unknown field \"username\"\nlevel=info msg=Attempting to unmarshal while ignoring unknown keys because strict unmarshaling failed\nlevel=error msg=failed to fetch Master Machines: failed to load asset \"Install Config\": failed to create install config: invalid \"install-config.yaml\" file: [platform.vsphere.vcenters.username: Required value: must specify the username, platform.vsphere.failureDomains.server: Invalid value: \"vcenterplaceholder\": server does not exist in vcenters]\n" func="github.com/openshift/assisted-service/internal/ignition.(*installerGenerator).runCreateCommand" file="/src/internal/ignition/ignition.go:1688" cluster_id=aad05e8e-a9fe-4f60-b580-0b2f6c4fdf10 error="exit status 3" go-id=1569 request_id= time="2023-09-25T21:29:07Z" level=error msg="failed generating install config for cluster aad05e8e-a9fe-4f60-b580-0b2f6c4fdf10" func="github.com/openshift/assisted-service/internal/bminventory.(*bareMetalInventory).generateClusterInstallConfig" file="/src/internal/bminventory/inventory.go:1755" cluster_id=aad05e8e-a9fe-4f60-b580-0b2f6c4fdf10 error="error running openshift-install manifests, level=warning msg=failed to parse first occurrence of unknown field: failed to unmarshal install-config.yaml: error unmarshaling JSON: while decoding JSON: json: unknown field \"username\"\nlevel=info msg=Attempting to unmarshal while ignoring unknown keys because strict unmarshaling failed\nlevel=error msg=failed to fetch Master Machines: failed to load asset \"Install Config\": failed to create install config: invalid \"install-config.yaml\" file: [platform.vsphere.vcenters.username: Required value: m <TRUNCATED>: exit status 3" go-id=1569 pkg=Inventory request_id=
The vSphere credentials needs to be passed through to assisted-service. The AgentConfigInstall ZTP manifests has an annotation where the install-config override can be set.
Acceptance Criteria:
Use any values provided in the install-config (e.g. root device hints, network config, BMC details) as defaults for the agent-config for the baremetal platform when installing with the agent-based installer.
If the agent-config file specifies host-specific settings then these should override the install-config. This enables users to use the same config for both agent-based and IPI installation.
If the user has an IPI install-config complete with BMC credentials, pass them through to the cluster so that it will end up with BareMetalHosts that can be managed by MAPI just as they would after an IPI install, instead of then having to add the credentials again on day 2.
BMC credentials must remain optional in the install-config though.
We allow the user to specify per-host settings (e.g. root device hints, network config) in the agent-config file, on any platform.
However, if the platform is baremetal, there is also fields for this data in the platform section of the install-config. Currently any data specified here is ignored (with a warning about this logged).
We should use any values provided in the install-config as defaults for the agent-config. If the agent-config specifies host-specific settings then these should override the install-config. This enables users to use the same config for both agent-based and IPI installation.
As part of this work, the logs warning of unused values should be removed.
If the install-config.yaml contains baremetal host configuration fields, and no host fields are defined in agent-config.yaml, use the install-config settings to define the hosts. The following fields will be copied over from the bm host definition
https://github.com/openshift/installer/blob/master/pkg/types/baremetal/platform.go#L37
The install-config struct does not have an interfaces field that matches agent-config, instead the bootMacAddress in install-config host will be used to create the interfaces array.
The hardwareProfile and bootMode fields will not be copied.
Add the name of the template to the infrastructure failure domain generated by the installer.
The OpenShift API needs to be updated to define VSphereFailureDomain. A draft PR is here: https://github.com/openshift/api/pull/1539
Also, ensure that the client-go and openshift-cluster-config-operator projects are bumped once the API changes merge.
nutanix is performing work in parallel and we need to pull out the common bits in to a non-vSphere specific PR
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
Why is this important? (mandatory)
What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
To align with the 4.15 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.15 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.15 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.15 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.15 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.15 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.15 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.15 release, dependencies need to be updated to 1.27. This should be done by rebasing/updating as appropriate for the repository
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF).
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.
This includes (but is not limited to):
Operators:
EOL, do not upgrade:
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update all CSI sidecars to the latest upstream release from https://github.com/orgs/kubernetes-csi/repositories
Corresponding downstream repos have `csi-` prefix, e.g. github.com/openshift/csi-external-attacher.
This includes update of VolumeSnapshot CRDs in cluster-csi-snapshot-controller- operator assets and client API in go.mod. I.e. copy all snapshot CRDs from upstream to the operator assets + go get -u github.com/kubernetes-csi/external-snapshotter/client/v6 in the operator repo.
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
(Using separate cards for each driver because these updates can be more complicated)
Update the driver to the latest upstream release. Notify QE and docs with any new features and important bugfixes that need testing or documentation.
This includes ibm-vpc-node-label-updater!
(Using separate cards for each driver because these updates can be more complicated)
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
To align with the 4.15 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.15 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.15 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.15 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.15 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.15 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.15 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.15 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.15 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.15 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.15 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.15 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.15 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.15 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.15 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.15 release, dependencies need to be updated to 1.28. This should be done by rebasing/updating as appropriate for the repository
Support OpenShift deployments on Azure to configure public and private exposure for OpenShift API and OpenShift Ingress separately at installation time
To reconcile the difference in publishing strategy the Installer provides on Azure vs what ARO offers. Upstream the capability to set split public and private API server and Ingress component.
The user should be able to provide public or private publish configuration at installation time for the API and Ingress components.
The initial use case will be for ARO customers as explained in ARO-2803
ARO currently offers the ability to specify APIserver and ingress visibility at install time. You can set either to public or private, and they can differ (i.e. public ingress, private apiserver).
OpenShift currently does not have this feature natively, it must be done either day2 or via some other mechanism. Based on what you set the value of "publish" to in your install config (Internal | External) the components will or will not be internet accessible.
This will require user facing documentation as any other option we currently document for OpenShift Installer
As a developer, I want to be able to:
so that I can
Description of criteria:
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Add support for the Installer to encrypt the Storage Account used for OpenShift on Azure at installation time.
As a user I can instruct the Installer to encrypt the Storage Account created while deploying OpenShift on Azure for increased security.
The user is able to provide a Storage Account encryption ID to be used when the Installer creates the Storage Account for OpenShift on Azure.
Related work on disks encryption for Azure was delivered as part of OCPSTRAT-308 feature. Now we are extending this to the Storage Account.
Usual documentation will be required to explain how to use this option in the Installer
Terraform is used for storage account creation
As a user, I want to be able to:
so that
Description of criteria:
Not encrypting the storage account used for bootstrap
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Goal
Hardware RAID support on Dell, Supermicro and HPE with Metal3.
Why is this important
Setting up RAID devices is a common operation in the hardware for OpenShift nodes. While there's been work at Fujitsu for configuring RAID in Fujitsu servers with Metal3, we don't support any generic interface with Redfish to extend this support and set it up it in Metal3.
Dell, Supermicro and HPE, which are the most common hardware platforms we find in our customers environments are the main target.
Goal
Hardware RAID support on Dell with Metal3.
Why is this important
Setting up RAID devices is a common operation in the hardware for OpenShift nodes. While there's been work at Fujitsu for configuring RAID in Fujitsu servers with Metal3, we don't support any generic interface with Redfish to extend this support and set it up it in Metal3 for Dell, which are the most common hardware platforms we find in our customers environments.
Before implementing generic support, we need to understand the implications of enabling an interface in Metal3 to allow it on multiple hardware types.
Scope questions
While rendering BMO in https://issues.redhat.com/browse/METAL-829 the node cpu_arch was hardcoded to x86_64
We should use bmh.Spec.Architecture instead to be more future proof
CoreOS Layering allows users to derive customized disk images for cluster nodes and update their nodes to these custom images. However, OpenShift still uses pristine images for the first boot of any machine. Customers should be allowed to use customized firstboot RHCOS images for machine installation.
This is critical to the needs of many OEMs and those working with cutting edge hardware with rapidly developing drivers.
Investigate how we can enable customer to create their own boot images from a derived container.
One potential path is to work together with Image Builder team on what technology can be shared between our pipeline(s) and their building service.
Phase 1 focuses on building a RAW/QCOW2 FCOS/RHCOS disk image from a container image using osbuild and integrating with COSA / our pipeline.
In COSA when we create a disk image we stamp in a .coreos-aleph-version.json file that we then use for various other things later in time. We need an equivalent component when building disk images using OSBuild. Today the file is creating using this code:
cat > $rootfs/.coreos-aleph-version.json << EOF { "build": "${buildid}", "ref": "${ref}", "ostree-commit": "${commit}", "imgid": "${imgid}" } EOF
We can feed some of this information into OSBuild via the manifest file but it might be really nice to be able to detect it dynamically from the installed tree so that we can reduce the requirements on future users of this having to know the values.
Another thing, I think since the `imgid` part here depends on what kind of image we are creating I think we might need to create a separate pipeline (i.e. `tree-qemu`) that adds in the correct imgid part, that is then fed into the qcow2 assembler. We probably also need to do something similar for `ignition.platform.id=qemu` in the future.
GitHub Internal PR: https://github.com/dustymabe/osbuild/pull/10
Upstream PR to OSBuild: https://github.com/osbuild/osbuild/pull/1475
Support AWS Wavelength Zones as a target infrastructure where to deploy OpenShift compute nodes.
As a user, I want to deploy OpenShift compute nodes on AWS Wavelength Zones at install time so I can leverage this infrastructure to deploy edge computing applications.
As a user, I want to extend an existing OpenShift cluster on AWS deploying compute nodes on AWS Wavelength Zones so I can leverage this infrastructure to deploy edge computing applications.
The Installer will be able to deploy OpenShift on the public region into an existing VPC with compute nodes on AWS Wavelength Zones into an existing subnet.
The Installer will be able to deploy OpenShift on the public region with compute nodes on AWS Wavelength Zones automating the VPC creation in the public region and the subnet creation in the AWS Wavelength Zone
An existing OpenShift cluster on AWS public region can be extended by adding additional compute nodes (that can be automatically scaled) into AWS Wavelength Zones.
Build media and entertainment applications.
Accelerate ML inference at the edge.
Develop connected vehicle applications.
There is an extended demand for running specific workloads on edge locations on cloud providers. We have added support for AWS Outposts and AWS Local Zones. AWS Wavelength Zones is a demanded target infrastructure that customers are asking for including ROSA customers.
Usual documentation will be required to instruct the user on how to use this feature
As a user, I want to deploy OpenShift compute nodes on AWS Wavelength Zones at install time in public subnets so I can leverage this infrastructure to deploy edge computing applications.
As a user, I want to deploy OpenShift compute nodes on AWS Wavelength Zones in public subnets in existing clusters installed with edge nodes so I can leverage this infrastructure to deploy edge computing applications.
USER STORY:
Goal:
Wavelength Zones operates in Carrier Network, to ingress traffic to instances running into that zones, the Carrier IP Address must be assigned. The Carrier IP address is assigned to the instance when the network interface flag AssociateCarrierIpAddress must be set to when provisioning the instance.
The PublicIP is the existing flag available in the MachineSet to assign public IP address to node running in regular zone, the goal of this card is to teach MAPI AWS provider to look at the zone type for the subnet, and when the value is 'wavelength-zone' the flag AssociateCarrierIpAddress must be set to true, instead of the default AssociatePublicIpAddress, allowing EC2 service to assign public IP address in the carrier network.
Required:
ACCEPTANCE CRITERIA:
ENGINEERING DETAILS:
The Installer will be able to deploy OpenShift on the public region into an existing VPC with compute nodes on AWS Wavelength Zones into an existing subnet.
As a user, I want to deploy OpenShift compute nodes on AWS Wavelength Zones at install time so I can leverage this infrastructure to deploy edge computing applications.
The Installer will be able to deploy OpenShift on the public region with compute nodes on AWS Wavelength Zones automating the VPC creation in the public region and the subnet creation in the AWS Wavelength Zone
USER STORY:
DESCRIPTION:
AWS Wavelength Zones are infrastructures running in RAN (Radio Access Network) owned by a Carrier, outside the region. AWS provides a few services, including computing, network ingress traffic in the carrier network, and private network connectivity with the VPC in the region.
The Installer must be able be able to deploy OpenShift worker nodes on the public region with compute nodes on AWS Wavelength Zones automating the VPC creation in the public region and the subnet creation in the AWS Wavelength Zone.
The installer must create the private and public subnet, and the worker node must use the private subnet.
In the traditional deployment for OpenShift on AWS, the private subnet egresss traffic to the internet using NAT Gateway. NAT Gateway is not currently supported in AWS Wavelength Zones. To remediate that, the deployment must follow the same strategy of associating the private subnets in Wavelength Zones to a route table in the region, preferably the route table for the WLZ's parent zone*.
*Every "edge zone" (Local and Wavelength Zone) is associated with one zone in the region, named the parent zone.
To ingress traffic to Wavelength Zone nodes, a public subnet must be created, and associated with a route table which have a carrier gateway as a default route. The carrier gateway is a similar internet gateway.
Required:
Nice to have:
...
{}ACCEPTANCE CRITERIA:{}
<!--
Describe the goals that need to be achieved so that this story can be
considered complete. Note this will also help QE to write their acceptance
tests.
-->
{}ENGINEERING DETAILS:{}
<!--
Any additional information that might be useful for engineers: related
repositories or pull requests, related email threads, GitHub issues or
other online discussions, how to set up any required accounts and/or
environments if applicable, and so on.
-->
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
Some OEM partners building RHDE platforms using MicroShift want to provide OLM to their customers with specially curated catalogs to allow end-user applications to depend on and install operators.
The observable functionality that the user now has as a result of receiving this feature. Complete during New status.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Extend OpenShift on IBM Cloud integration with additional features to pair the capabilities offered for this provider integration to the ones available in other cloud platforms.
Extend the existing features while deploying OpenShift on IBM Cloud.
This top level feature is going to be used as a placeholder for the IBM team who is working on new features for this integration in an effort to keep in sync their existing internal backlog with the corresponding Features/Epics in Red Hat's Jira.
A user currently is not able to create a Disconnected cluster, using IPI, on IBM Cloud.
Currently, support for BYON and Private clusters does exist on IBM Cloud, but support to override IBM Cloud Service endpoints does not exist, which is required to allow for Disconnected support to function (reach IBM Cloud private endpoints).
IBM Cloud VPC (x86_64) currently does not support Disconnected cluster installation via IPI.
In order to add this support, the override of certain IBM Cloud Services (e.g., IAM, IaaS), must be configurable and made available in the cluster infrastructure resource.
Implementation is dependent on API changes merging first (https://issues.redhat.com/browse/SPLAT-1097)
Installer validates and injects user provided endpoint overrides into cluster deployment process.
A user currently is not able to create a Disconnected cluster, using IPI, on IBM Cloud.
Currently, support for BYON and Private clusters does exist on IBM Cloud, but support to override IBM Cloud Service endpoints does not exist, which is required to allow for Disconnected support to function (reach IBM Cloud private endpoints).
IBM dependent components of OCP will need to add support to use a set of endpoint override values in order to reach IBM Cloud Services in Disconnected environments.
The MAPI component will need to be able to allow all API calls to IBM Cloud Services, be directed to these endpoint values, in order to communicate in environments where the Public or default IBM Cloud Service endpoint is not available.
The endpoint overrides are available via the infrastructure/cluster (.status.platformStatus.ibmcloud.serviceEndpoints) resource, which is how a majority of components are consuming cluster specific configurations (Ingress, MAPI, etc.). It will be structured as such
apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2023-10-04T22:02:15Z" generation: 1 name: cluster resourceVersion: "430" uid: b923c3de-81fc-4a0e-9fdb-8c4c337fba08 spec: cloudConfig: key: config name: cloud-provider-config platformSpec: type: IBMCloud status: apiServerInternalURI: https://api-int.us-east-disconnect-21.ipi-cjschaef-dns.com:6443 apiServerURL: https://api.us-east-disconnect-21.ipi-cjschaef-dns.com:6443 controlPlaneTopology: HighlyAvailable cpuPartitioning: None etcdDiscoveryDomain: "" infrastructureName: us-east-disconnect-21-gtbwd infrastructureTopology: HighlyAvailable platform: IBMCloud platformStatus: ibmcloud: dnsInstanceCRN: 'crn:v1:bluemix:public:dns-svcs:global:a/fa4fd9fa0695c007d1fdcb69a982868c:f00ac00e-75c2-4774-a5da-44b2183e31f7::' location: us-east providerType: VPC resourceGroupName: us-east-disconnect-21-gtbwd serviceEndpoints: - name: iam url: https://private.us-east.iam.cloud.ibm.com - name: vpc url: https://us-east.private.iaas.cloud.ibm.com/v1 - name: resourcecontroller url: https://private.us-east.resource-controller.cloud.ibm.com - name: resourcemanager url: https://private.us-east.resource-controller.cloud.ibm.com - name: cis url: https://api.private.cis.cloud.ibm.com - name: dnsservices url: https://api.private.dns-svcs.cloud.ibm.com/v1 - name: cis url: https://s3.direct.us-east.cloud-object-storage.appdomain.cloud type: IBMCloud
The CCM is currently relying on updates to the openshift-cloud-controller-manager/cloud-conf configmap, in order to override its required IBM Cloud Service endpoints, such as:
data: config: |+ [global] version = 1.1.0 [kubernetes] config-file = "" [provider] accountID = ... clusterID = temp-disconnect-7m6rw cluster-default-provider = g2 region = eu-de g2Credentials = /etc/vpc/ibmcloud_api_key g2ResourceGroupName = temp-disconnect-7m6rw g2VpcName = temp-disconnect-7m6rw-vpc g2workerServiceAccountID = ... g2VpcSubnetNames = temp-disconnect-7m6rw-subnet-compute-eu-de-1,temp-disconnect-7m6rw-subnet-compute-eu-de-2,temp-disconnect-7m6rw-subnet-compute-eu-de-3,temp-disconnect-7m6rw-subnet-control-plane-eu-de-1,temp-disconnect-7m6rw-subnet-control-plane-eu-de-2,temp-disconnect-7m6rw-subnet-control-plane-eu-de-3 iamEndpointOverride = https://private.iam.cloud.ibm.com g2EndpointOverride = https://eu-de.private.iaas.cloud.ibm.com rmEndpointOverride = https://private.resource-controller.cloud.ibm.com
These changes have already landed in the release-1.28 branch (target OCP release-4.15 branch), but we need to make sure they get pulled into the github.com/openshift/cloud-provider-ibm branch and built into a 4.15 image.
Installer validates and injects user provided endpoint overrides into cluster deployment process and the MAPI components use specified endpoints and start up properly.
A user currently is not able to create a Disconnected cluster, using IPI, on IBM Cloud.
Currently, support for BYON and Private clusters does exist on IBM Cloud, but support to override IBM Cloud Service endpoints does not exist, which is required to allow for Disconnected support to function (reach IBM Cloud private endpoints).
IBM dependent components of OCP will need to add support to use a set of endpoint override values in order to reach IBM Cloud Services in Disconnected environments.
The Storage components will need to be able to allow all API calls to IBM Cloud Services, be directed to these endpoint values, in order to communicate in environments where the Public or default IBM Cloud Service endpoint is not available.
The endpoint overrides are available via the infrastructure/cluster (.status.platformStatus.ibmcloud.serviceEndpoints) resource, which is how a majority of components are consuming cluster specific configurations (Ingress, MAPI, etc.). It will be structured as such
apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2023-10-04T22:02:15Z" generation: 1 name: cluster resourceVersion: "430" uid: b923c3de-81fc-4a0e-9fdb-8c4c337fba08 spec: cloudConfig: key: config name: cloud-provider-config platformSpec: type: IBMCloud status: apiServerInternalURI: https://api-int.us-east-disconnect-21.ipi-cjschaef-dns.com:6443 apiServerURL: https://api.us-east-disconnect-21.ipi-cjschaef-dns.com:6443 controlPlaneTopology: HighlyAvailable cpuPartitioning: None etcdDiscoveryDomain: "" infrastructureName: us-east-disconnect-21-gtbwd infrastructureTopology: HighlyAvailable platform: IBMCloud platformStatus: ibmcloud: dnsInstanceCRN: 'crn:v1:bluemix:public:dns-svcs:global:a/fa4fd9fa0695c007d1fdcb69a982868c:f00ac00e-75c2-4774-a5da-44b2183e31f7::' location: us-east providerType: VPC resourceGroupName: us-east-disconnect-21-gtbwd serviceEndpoints: - name: iam url: https://private.us-east.iam.cloud.ibm.com - name: vpc url: https://us-east.private.iaas.cloud.ibm.com/v1 - name: resourcecontroller url: https://private.us-east.resource-controller.cloud.ibm.com - name: resourcemanager url: https://private.us-east.resource-controller.cloud.ibm.com - name: cis url: https://api.private.cis.cloud.ibm.com - name: dnsservices url: https://api.private.dns-svcs.cloud.ibm.com/v1 - name: cis url: https://s3.direct.us-east.cloud-object-storage.appdomain.cloud type: IBMCloud
The CCM is currently relying on updates to the openshift-cloud-controller-manager/cloud-conf configmap, in order to override its required IBM Cloud Service endpoints, such as:
data: config: |+ [global] version = 1.1.0 [kubernetes] config-file = "" [provider] accountID = ... clusterID = temp-disconnect-7m6rw cluster-default-provider = g2 region = eu-de g2Credentials = /etc/vpc/ibmcloud_api_key g2ResourceGroupName = temp-disconnect-7m6rw g2VpcName = temp-disconnect-7m6rw-vpc g2workerServiceAccountID = ... g2VpcSubnetNames = temp-disconnect-7m6rw-subnet-compute-eu-de-1,temp-disconnect-7m6rw-subnet-compute-eu-de-2,temp-disconnect-7m6rw-subnet-compute-eu-de-3,temp-disconnect-7m6rw-subnet-control-plane-eu-de-1,temp-disconnect-7m6rw-subnet-control-plane-eu-de-2,temp-disconnect-7m6rw-subnet-control-plane-eu-de-3 iamEndpointOverride = https://private.iam.cloud.ibm.com g2EndpointOverride = https://eu-de.private.iaas.cloud.ibm.com rmEndpointOverride = https://private.resource-controller.cloud.ibm.com
The Storage component is reliant on the CCM cloud-conf configmap, but only the IAM, ResourceManager, and VPC endpoints are supplied, since that is all CCM uses. If additional IBM Cloud Services are used (e.g., COS, etc.), they will not be available in the CCM cloud-conf, but will always be in the infrastructure/cluster resource.
Installer validates and injects user provided endpoint overrides into cluster deployment process and the storage components use specified endpoints and start up properly.
A user currently is not able to create a Disconnected cluster, using IPI, on IBM Cloud.
Currently, support for BYON and Private clusters does exist on IBM Cloud, but support to override IBM Cloud Service endpoints does not exist, which is required to allow for Disconnected support to function (reach IBM Cloud private endpoints).
IBM dependent components of OCP will need to add support to use a set of endpoint override values in order to reach IBM Cloud Services in Disconnected environments.
The Ingress Operator components will need to be able to allow all API calls to IBM Cloud Services, be directed to these endpoint values, in order to communicate in environments where the Public or default IBM Cloud Service endpoint is not available.
The endpoint overrides are available via the infrastructure/cluster (.status.platformStatus.ibmcloud.serviceEndpoints) resource, which is how a majority of components are consuming cluster specific configurations (Ingress, MAPI, etc.). It will be structured as such
apiVersion: config.openshift.io/v1 kind: Infrastructure metadata: creationTimestamp: "2023-10-04T22:02:15Z" generation: 1 name: cluster resourceVersion: "430" uid: b923c3de-81fc-4a0e-9fdb-8c4c337fba08 spec: cloudConfig: key: config name: cloud-provider-config platformSpec: type: IBMCloud status: apiServerInternalURI: https://api-int.us-east-disconnect-21.ipi-cjschaef-dns.com:6443 apiServerURL: https://api.us-east-disconnect-21.ipi-cjschaef-dns.com:6443 controlPlaneTopology: HighlyAvailable cpuPartitioning: None etcdDiscoveryDomain: "" infrastructureName: us-east-disconnect-21-gtbwd infrastructureTopology: HighlyAvailable platform: IBMCloud platformStatus: ibmcloud: dnsInstanceCRN: 'crn:v1:bluemix:public:dns-svcs:global:a/fa4fd9fa0695c007d1fdcb69a982868c:f00ac00e-75c2-4774-a5da-44b2183e31f7::' location: us-east providerType: VPC resourceGroupName: us-east-disconnect-21-gtbwd serviceEndpoints: - name: iam url: https://private.us-east.iam.cloud.ibm.com - name: vpc url: https://us-east.private.iaas.cloud.ibm.com/v1 - name: resourcecontroller url: https://private.us-east.resource-controller.cloud.ibm.com - name: resourcemanager url: https://private.us-east.resource-controller.cloud.ibm.com - name: cis url: https://api.private.cis.cloud.ibm.com - name: dnsservices url: https://api.private.dns-svcs.cloud.ibm.com/v1 - name: cis url: https://s3.direct.us-east.cloud-object-storage.appdomain.cloud type: IBMCloud
The CCM is currently relying on updates to the openshift-cloud-controller-manager/cloud-conf configmap, in order to override its required IBM Cloud Service endpoints, such as:
data: config: |+ [global] version = 1.1.0 [kubernetes] config-file = "" [provider] accountID = ... clusterID = temp-disconnect-7m6rw cluster-default-provider = g2 region = eu-de g2Credentials = /etc/vpc/ibmcloud_api_key g2ResourceGroupName = temp-disconnect-7m6rw g2VpcName = temp-disconnect-7m6rw-vpc g2workerServiceAccountID = ... g2VpcSubnetNames = temp-disconnect-7m6rw-subnet-compute-eu-de-1,temp-disconnect-7m6rw-subnet-compute-eu-de-2,temp-disconnect-7m6rw-subnet-compute-eu-de-3,temp-disconnect-7m6rw-subnet-control-plane-eu-de-1,temp-disconnect-7m6rw-subnet-control-plane-eu-de-2,temp-disconnect-7m6rw-subnet-control-plane-eu-de-3 iamEndpointOverride = https://private.iam.cloud.ibm.com g2EndpointOverride = https://eu-de.private.iaas.cloud.ibm.com rmEndpointOverride = https://private.resource-controller.cloud.ibm.com
Installer validates and injects user provided endpoint overrides into cluster deployment process and the Ingress Operator components use specified endpoints and start up properly.
We have a blocking issue that is fixed and included in IBM TF Provider release 1.60.0. We need this pulled into OCP 4.15.
IBM PowerVS has been notified.
Add support for il-central-1 in AWS
As a user I'm able to deploy OpenShift in il-central-1 in AWS and this region is fully supported
A user can deploy OpenShift in AWS il-central-1 using all the supported installation tools for self-managed customers.
The support of this region is backported to the previous OpenShift EUS release.
The corresponding RHCOS image needs to be available in the new region so the Installer can list the region.
AWS has added support for a new region in their public cloud offering and this region needs to be supported for OpenShift deployments as other regions.
The information about the new region needs to be added to the documentation so this is supported.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
MVP aims at refactoring MirrorToDisk and DiskToMirror for OCP releases
As an MVP, this epic covers the work for RFE-3800 (includes RFE-3393 and RFE-3733) for mirroring releases.
The full description / overview of the enclave support is best described here
The design document can be found here
Upcoming epics, such as CFE-942 will complete the RFE work with mirroring operators, additionalImages, etc.
Architecture Overview (diagram)
oc-mirror v1 and v2 can cohabitate.
by default, v1 code is called.
oc-mirror switches to the use of the v2 when a certain flag is added
logs from the local storage registry of oc-mirror need to be redirected in another log file
As a oc-mirror user, I want the tar generated by mirror to disk process to be as small as possible so that its transfer to the enclaves is as quick as possible.
Background:
After the first demo done by the team, the initial solution that consisted of archiving the whole cache and sending it through the one-way diode to the enclaves was refused by the stakeholders.
CFE-966 studied a solution to include in the tar:
This story is about implementing the studied solution.
Acceptance criteria:
I need an implementation (and interface) that constructs a tar.gz from:
Tar contents:
Diff logic:
I need an implementation (and interface) that can be used at the beginning of the diskToMirror process (!!!! before starting the local cache) and that would extract each part of the tar.gz to its location:
Definitions:
Requirement
AS the admin of several clusters of my company, with several enclaves involved,
WHEN doing MirrorToDisk for several enclaves
I would like to be able to reuse the cache-dir of my main environement (entreprise level) for all enclaves,
AND use a separate working-dir for each enclave
SO THAT I can gain in storage volume and in performance, while preserving a separate context for each enclave
Check how to enable signature verification using the skopeo mod (copy method)
When graph: true is specified for releases in the imageSetConfig, I'd like oc-mirror to create and mirror a graph image for ocp releases
In order to keep the same behavior as v1, we need to have the mirroring at the namespace level
As an openshift admin i want to prevent must-gather to fill the disk space as must-gather runs in master node so that if it fills up disk then it can cause problem in master node thus can affecting the stability of the my OCP env
The observable functionality that the user now has as a result of receiving this feature. Complete during New status.
Define configurable default limit to emptydir volume in must gather pod
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
This card is about:
Networking Definition of Planned
Epic Template descriptions and documentation
With ovn-ic we have multiple actors (zones) setting status on some CRs. We need to make sure individual zone statuses are reported and then optionally merged to a single status
Without that change zones will overwrite each others statuses.
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1.
...
1. …
1. …
The MCO merges MachineConfigs in alphanumerical order. Because all custom pools also are workers, this effectively means that all "worker" configs will take precedence over custom pools that come earlier in alphabetic order. This is counter to most expectations where the reason for creating the custom pool is in order to be different than the worker pool.
Custom pool configs will "win" over worker configs. This is inline with what most customers expect.
[original description related to a specific scenario that this more general feature will help facilitate. It's been moved to comments]
As an OpenShift admin with custom pools,
I would like my custom pool configuration, especially those generated by the MCO (kubelet, containerruntime, node config, etc.) to take priority over base worker pool configs
So I can have custom configs for custom pool take effect.
This is a behaviour change and should also be noted as such in docs/release notes.
Follow up cards were moved to https://issues.redhat.com/browse/MCO-773
Due to alphanumeric ordering of the MCs, MCC-generated configs with pool names will get ordered such that worker configuration will generally take priority over custom pools, unless the custom pool name started with x,y,or z.
This is mostly a problem for kubelet and containerruntimeconfigs. If the user wanted to create a kubeletconfig for base workers and then a special one for the custom pool, they are unable do easily do so.
We should assume the user wants custom configs to take priority since all base worker configuration is otherwise inherited. Thus our MC merging logic should be updated to handle this.
OLM users can stay on the supported path for an installed operator package by
High-level list of items that are out of scope. Initial completion during Refinement status.
We have customers (e.g., Citigroup) who have dozens of operators installed in hundreds of clusters. They need to conduct audits periodically to tell if all the installed operators are still within the support boundary. Currently, there’s no way for them to tell if a particular operator package, or an update channel being subscribed to, or a current running operator version is supported or not.
OLM extends the support in content deprecation management will enable operator package maintainers to curate this type of information better and benefits the admin types of users of our product.
As per #1146
GRPC API (pkg/api/*; pkg/registry/types.go; utests)
Upstream Github issue: https://github.com/operator-framework/operator-registry/issues/1154
As a cluster-admin I want to get accurate information about the status of operators. The cluster is should not tell that portions of the cluster are Degraded or Unavailable when they're actually not. I need to see reduced false positive messaging from the status message of CVO.
Background:
This Feature is a continuation of https://issues.redhat.com/browse/OCPSTRAT-180.
Customers are asking for improvements to the upgrade experience (both over-the-air and disconnected). This is a feature tracking epics required to get that work done.
These are alarming conditions which may frighten customers, and we don't want to see them in our own, controlled, repeatable update CI. This example job had logs like:
Feb 18 21:11:25.799 E clusteroperator/openshift-apiserver changed Degraded to True: APIServerDeployment_UnavailablePod: APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver ()
And the job failed, but none of the failures were "something made openshift-apiserver mad enough to go Degraded".
Cluster administrators need an in-product experience to discover and install new Red Hat offerings that can add high value to developer workflows.
Requirements | Notes | IS MVP |
Discover new offerings in Home Dashboard | Y | |
Access details outlining value of offerings | Y | |
Access step-by-step guide to install offering | N | |
Allow developers to easily find and use newly installed offerings | Y | |
Support air-gapped clusters | Y |
< What are we making, for who, and why/what problem are we solving?>
Discovering solutions that are not available for installation on cluster
No known dependencies
Background, and strategic fit
None
Quick Starts
Cluster admins need to be guided to install RHDH on the cluster.
Enable admins to discover RHDH, be guided to installing it on the cluster, and verifying its configuration.
RHDH is a key multi-cluster offering for developers. This will enable customers to self-discover and install RHDH.
RHDH operator
As a cluster admin, I want to see and learn how to install Janus IDP / Red Hat Developer Hub (RHDH)
Provide better insights into performance and frequency of OpenShift pipeline runs
Show historical and real-time pipeline run data in a unified UI panel, with drill down capabilities.
Provide a visual dashboard that is competitive with leading pipeline solutions
Provide access to logs of running and historical pipeline runs
Enable in-context links to manage pipeline definitions.
Apply RBAC policies to data access.
Provide a Prow dashboard similar to this: https://prow.k8s.io/
Additionally, the goal is that this work will be the beginning of the dynamic plugin for OpenShift Pipelines, which would be installed by the OpenShift Pipelines operator.
Add a new option to the "internet proxy" to allow insecure communication that ignores CORS wit the tekton results API.
This must not be used in the final implementation! It's a workaround to start the UI developing.
As a user, I want to see data from the k8s API and Tekton results API in the same list page
As a user, I want to see the TaskRuns from the Tekton Results data source.
Doc to install the Results on cluster https://docs.openshift-pipelines.org/operator/install-result.html
Tekton Results API swagger https://petstore.swagger.io/?url=https://raw.githubusercontent.com/avinal/tektoncd-results/openapi-fixes/docs/api/openapi.yaml
As a user, I want to see the info on the details page from where PipelineRun is loaded
NOTE: Events are not available in the Tekton Results API. This is only available starting with OSP 1.12+
Doc to install the Results on cluster https://docs.openshift-pipelines.org/operator/install-result.html
Tekton Results API swagger https://petstore.swagger.io/?url=https://raw.githubusercontent.com/avinal/tektoncd-results/openapi-fixes/docs/api/openapi.yaml
If you delete a pipelinerun then it will be deleted in the k8s cluster but the data will be available in the tekton-results database, so the pipeline run list view will show the deleted pipelineruns
Openshift Pipelines operator should be installed
tektonResults should be installed and working in the cluster(Follow this https://gist.github.com/vikram-raj/257d672a38eb2159b0368eaed8f8970a)
It endlessly shows pipeline run and mentions `Resource is being deleted` in kebab menu
Gracefully handle it like adding a hint to the user that the Delete action will only delete it from the etcd storage and disable the delete action for results-based PLRS
Always
Slack thread: https://redhat-internal.slack.com/archives/CHG0KRB7G/p1700140684929449
As a user, I want to see the info on the details page from where TaskRun is loaded
NOTE: Events are not available in the Tekton Results API. This is only available starting with OSP 1.12+
Doc to install the Results on cluster https://docs.openshift-pipelines.org/operator/install-result.html
Tekton Results API swagger https://petstore.swagger.io/?url=https://raw.githubusercontent.com/avinal/tektoncd-results/openapi-fixes/docs/api/openapi.yaml
As a user, I want to see the PipelineRuns from the Tekton Results data source.
Doc to install the Results on cluster https://docs.openshift-pipelines.org/operator/install-result.html
Tekton Results API swagger https://petstore.swagger.io/?url=https://raw.githubusercontent.com/avinal/tektoncd-results/openapi-fixes/docs/api/openapi.yaml
The MCO should properly report its state in a way that's consistent and able to be understood by customers, troubleshooters, and maintainers alike.
Some customer cases have revealed scenarios where the MCO state reporting is misleading and therefore could be unreliable to base decisions and automation on.
In addition to correcting some incorrect states, the MCO will be enhanced for a more granular view of update rollouts across machines.
The MCO should properly report its state in a way that's consistent and able to be understood by customers, troubleshooters, and maintainers alike.
For this epic, "state" means "what is the MCO doing?" – so the goal here is to try to make sure that it's always known what the MCO is doing.
This includes:
While this probably crosses a little bit into the "status" portion of certain MCO objects, as some state is definitely recorded there, this probably shouldn't turn into a "better status reporting" epic. I'm interpreting "status" to mean "how is it going" so status is maybe a "detail attached to a state".
Exploration here: https://docs.google.com/document/d/1j6Qea98aVP12kzmPbR_3Y-3-meJQBf0_K6HxZOkzbNk/edit?usp=sharing
https://docs.google.com/document/d/17qYml7CETIaDmcEO-6OGQGNO0d7HtfyU7W4OMA6kTeM/edit?usp=sharing
Implementing and Merging the the new Upgrade Monitoring mechanism are to be separated. As I am working on the functionality while also wrestling with review, API migrations, and featuregate.
This card is meant to track merging the various components of the machineconfignode in the MCO.
Ensure that the pod exists but the functionality behind the pod is not exposed by default in the release version this work ships in.
This can be done by creating a new featuregate in openshift/api, vendoring that into the cluster config operator, and then checking for this featuregate in the state controller code of the MCO.
As a user, I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Consolidated Enhancement of HyperShift/KubeVirt Provider Post GA
This feature aims to provide a comprehensive enhancement to the HyperShift/KubeVirt provider integration post its GA release.
By consolidating CSI plugin improvements, core improvements, and networking enhancements, we aim to offer a more robust, efficient, and user-friendly experience.
Post GA quality of life improvements for the HyperShift/KubeVirt core
Who | What | Reference |
---|---|---|
DEV | Upstream roadmap issue (or individual upstream PRs) | <link to GitHub Issue> |
DEV | Upstream documentation merged | <link to meaningful PR> |
DEV | gap doc updated | <name sheet and cell> |
DEV | Upgrade consideration | <link to upgrade-related test or design doc> |
DEV | CEE/PX summary presentation | label epic with cee-training and add a <link to your support-facing preso> |
QE | Test plans in Polarion | <link or reference to Polarion> |
QE | Automated tests merged | <link or reference to automated tests> |
DOC | Downstream documentation merged | <link to meaningful PR> |
In PR https://github.com/openshift/hypershift/pull/2576/ we had to disable the nodepool upgrade test. This is because there are no previous releases which have the new kubevirt rhcos variant available... so there's no release to upgrade from
We need to re-enable this test once we have a stable previously release in CI to test against (post 4.14 feature freeze and after 4.15 is branched.)
Currently there is no option to influence on the placement of the VMs of an hosted cluster with kubevirt provider. the existing NodeSelector in HostedCluster are influencing only the pods in the hosted control plane namespace.
The goal is to introduce an new field in .spec.platform.kubevirt stanza in NodePool for node selector, propagate it to the VirtualMachineSpecTemplate, and expose this in the hypershift and hcp CLIs.
kubevirt node pools currently only set requests for cpu/mem. This doesn't guarantee that the kubevirt VMs will have access to dedicated resources, which is something some customers may desire.
To resolve this, we should create a toggle on the nodepool under the kubevirt platform section to enable dedicated resources, which will give each VM guaranteed dedicated access to cpus and memory.
We need to make sure to document that multiqueue should only be used with MTU >= 9000 on the infra cluster. Smaller MTU sizes (like 1500 for example) actually displayed degraded results compared to not having multiqueue enabled at all.
CNV QE, field engineers, and developers often need to test hypershift kubevirt in a way that isn't officially supported yet, and this often involves needing to modify the kubevirt VM's spec to enable some sort of feature, add an interface/volume, or something else along those lines.
We need to design a mechanism that works as an escape hatch to allow these sorts of unsupported modifications to be experimented with easily. This mechanism should not be a part of the official Hypershift APIs, but instead something that people can influence via an annotation or similar means.
It's likely this feature will serve as a way for us to grant temporary support exceptions to customers as well.
This can be achieved using an annotation with a json patch in it. Below is an example of how such a json patch might be placed on a NodePool to influence the VMs generated by the NodePool to have a secondary interface.
apiVersion: hypershift.openshift.io/v1beta1 kind: NodePool metadata: annotations: hypershift.openshift.io/kubevirt-vm-jsonpatch: |- [ { "op": "add", "path": "/spec/template/spec/networks", "value": {"name": secondary, multus: networkName: mynetwork} }, { "op": "add", "path": "/spec/template/spec/domain/devices/interfaces", "value": {"name": secondary, bridge: {}} } ]
The goal of this epic is to provide a solution for tying HyperShift/KubeVirt vm worker nodes into networks outside of the default pod network.
An example scenario for this Epic is a user who wishes to run their KubeVirt worker node VMs on a network they have configured within their datacenter. The user already has IPAM on their network (likely through DHCP) and wishes the KubeVirt VMs for their HCP to be tied to this externally provisioned network rather than the default pod network provided by OVNKubernetes.
What is required for us in this scenario is to provide a way to configure usage of this user provided network on the NodePool, and ensuring that the capi ecosystem components (capk, cloud-provider-kubevirt) work as expected with this VM configuration
Role | Contact |
---|---|
PM | Peter Lauterbach |
Documentation Owner | TBD |
Delivery Owner | (See assignee) |
Quality Engineer | (See QE Assignee) |
Who | What | Reference |
---|---|---|
DEV | Upstream code and tests merged | https://github.com/openshift/hypershift/pull/3066 |
DEV | Upstream documentation merged | https://github.com/openshift/hypershift/pull/3464 |
DEV | gap doc updated | N/A |
DEV | Upgrade consideration | None |
DEV | CEE/PX summary presentation | N/A |
QE | Test plans in Polarion | N/A |
QE | Automated tests merged | https://github.com/openshift/hypershift/pull/3449 |
DOC | Downstream documentation merged | https://github.com/openshift/hypershift/pull/3464 |
We need the ability to configure a KubeVirt platform NodePool to use a custom network interface (not the default pod network) when creating the VMs.
Since cloud-provider-kubevirt will not be able to mirror LBs when a NodePool is not on a OVNKubernetes defined network, we'll need to make sure cloud-provider-kubevirt's LB mirroring behavior is disabled when custom networks are in use.
The scenarios that we need to cover are the ones extracted from the notes doc https://docs.google.com/document/d/1zzyHxUEPyEM4hgRh_jww4gRIKJhRTxYeKeSYhxw-pDc/edit
Scenarios
Description of problem:
hypershift kubevirt provider is missing the openshift mechanism to select what interface/ip address kubelet is going to use to register The nodeip-configuration.service should activated at MCO for kubevirt platform.
Version-Release number of selected component (if applicable):
4.15.0
How reproducible:
Always
Steps to Reproduce:
Depends on hypershift kubevirt multinet feature https://github.com/openshift/hypershift/pull/3066
1. Create an openshift libvirt/baremetal cluster with metallb, cnv, odf, local-storage and kubernetes-nmstate with a pair of extra nics at nodes 2. Populate the following network attachment definition and nncps to connect those extra nics --- apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: name: net1 annotations: k8s.v1.cni.cncf.io/resourceName: bridge.network.kubevirt.io/net1 spec: config: > { "cniVersion": "0.3.1", "name": "net1", "plugins": [{ "type": "cnv-bridge", "bridge": "net1", "ipam": {} }] } --- apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: name: net2 annotations: k8s.v1.cni.cncf.io/resourceName: bridge.network.kubevirt.io/net2 spec: config: > { "cniVersion": "0.3.1", "name": "net2", "plugins": [{ "type": "cnv-bridge", "bridge": "net2", "ipam": {} }] } --- apiVersion: nmstate.io/v1 kind: NodeNetworkConfigurationPolicy metadata: name: multi-net spec: desiredState: interfaces: - name: net1 type: linux-bridge state: up ipv4: enabled: false ipv6: enabled: false bridge: options: stp: enabled: true port: - name: ens4 - name: net2 type: linux-bridge state: up ipv4: enabled: false ipv6: enabled: false bridge: options: stp: enabled: true port: - name: ens5 3. Create a kubevirt hosted cluster using those nics with the following command --additional-network=name:default/net1 --additional-network=name:default/net2 --attach-default-network=false
Actual results:
kubelet end up expose the IP from net2 but ovn-k uses net1
Expected results:
kubelet and ovn-k should use net1
Additional info:
4.15 will only support secondary networks with the default network. The productized cli needs the `attach-default-network` cli option removed.
This cli option needs to be removed from main, 4.16, and 4.15 branches. We'll add it back once support for standalone secondary networks is supported.
Note, this is not a change to the API, only the CLI.
The goal of this epic is to introduce a validating webhook for the KubeVirt platform that executes the HostedCluster and NodePool validation at admission time.
One important note here is that the core Hypershift team has a requirement that all validation logic must be within the controller loop. This does not exclude the usage of a validation webhook, it merely means that if we introduce a validating webhook that it cannot replace the controller validation.
That means this task will involve abstracting our validation logic in a way that both the controller and our validating webhook share the same logic.
Who | What | Reference |
---|---|---|
DEV | Upstream roadmap issue (or individual upstream PRs) | <link to GitHub Issue> |
DEV | Upstream documentation merged | <link to meaningful PR> |
DEV | gap doc updated | <name sheet and cell> |
DEV | Upgrade consideration | <link to upgrade-related test or design doc> |
DEV | CEE/PX summary presentation | label epic with cee-training and add a <link to your support-facing preso> |
QE | Test plans in Polarion | <link or reference to Polarion> |
QE | Automated tests merged | <link or reference to automated tests> |
DOC | Downstream documentation merged | <link to meaningful PR> |
The underlying infa cluster hosting HCP KubeVirt worker VMs must meet some versioning requirements (cnv >= 4.14, ocp >= 4.14). There is a validation check that enforces this on the backend today. WE'd like to move this validation to a webhook so users get early feedback if the validation will fail during Creation.
The hypershift operator only supports specific versions of release payloads. We'd like to give users early feedback by validating that the release payload they have picked falls within the backend operator' supported window.
The backend controller loop performs this check here. We'd like the same check to be introduced into an optional validating webhook.
Currently, the Hypershift's HostedCluster and NodePool APIs are difficult to use directly. The "hcp" cli alleviates this complexity to some degree, but comes at the cost of requiring usage of a cli tool rather than creating the resources directly.
The Goal of this epic is to reduce the complexity of the HostedCluster and NodePool APIs to the point that users only need to specify a small set of values in these apis initially at create time, then during admission have a mutating webhook fill in the remaining details using the defaults that the "hcp" cli currently uses.
Essentialy, the goal here is to move the "magic" defaulting that is so convenient to users out of the "hcp" tool and to the hypershift operator backend using a mutation webhook.
Who | What | Reference |
---|---|---|
DEV | Upstream roadmap issue (or individual upstream PRs) | <link to GitHub Issue> |
DEV | Upstream documentation merged | <link to meaningful PR> |
DEV | gap doc updated | <name sheet and cell> |
DEV | Upgrade consideration | <link to upgrade-related test or design doc> |
DEV | CEE/PX summary presentation | label epic with cee-training and add a <link to your support-facing preso> |
QE | Test plans in Polarion | <link or reference to Polarion> |
QE | Automated tests merged | <link or reference to automated tests> |
DOC | Downstream documentation merged | <link to meaningful PR> |
All HC and NP API defaulting within the hcp cli that impacts the KubeVirt platform should be moved to CRD defaulting and mutating webhooks
Today the hcp cli tool is rendering an etcd encryption secret for each cluster that is created. We'd like for the backend to perform this logic so users can more easily use the HostedCluster API directly without needing the cli tool.
In a multi-tenant, both ICSP and IDMS objects should be functional in the cluster at the same time.
Enable both ICSP and IDMS objects to exist on the cluster at the same time and roll out both configurations.
ICSP and IDMS objects should be functional
Provide the ICSP to IDMS migration path without node reboot which can lead to disruption.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Do not error out if both ICSP IDMS resources exist.
MCO waches both ICSP, IDMS objects. As an openshift developer, I want it to process content from both kind CRD to underlying configuration.
More details at ARO managed identity scope and impact.
This Section: A list of specific needs or objectives that a Feature must deliver to satisfy the Feature.. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
Epic to capture the items not blocking for OCPSTRAT-506 (OCPBU-8)
Evaluate if any of the ARO predefined roles in the credentials request manifests of OpenShift cluster operators give elevated permissions. Remove any such predefined role from spec.predefinedRoles field and replace it with required permissions in the new spec.permissions field.
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
Remove use of Terraform in the IPI Installer from the top providers: AWS, vSphere, Metal, and Azure.
The IPI Installer no longer contains or uses Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Two major parts:
The installer should support feature gate validation so that new providers can be enabled via featuregates.
Create an installer image to be promoted to the release payload that contains the openshift altinfra binary (produced with build tags).
I want to be able to produce an openshift-install binary for installing AWS (other platforms will not be supported) which is free of Terraform.
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Encapsulate Terraform to its own package (removing dependencies from pkg/asset and cmd) so that Terraform can be included or removed based on build tags.
Interface provides a way of substituting other infrastructure providers.
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
oc, the openshift CLI, needs as close to feature parity as we can get without the built-in oauth server and its associated user and group management. This will enable scripts, documentation, blog posts, and knowledge base articles to function across all form factors and the same form factor with different configurations.
CLI users and scripts should be usable in a consistent way regardless of the token issuer configuration.
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
oc login needs to work without the embedded oauth server
Why is this important? (mandatory)
We are removing the embedded oauth-server and we utilize a special oauthclient in order to make our login flows functional
This allows documentation, scripts, etc to be functional and consistent with the last 10 years of our product.
This may require vendoring entire CLI plugins. It may require new kubeconfig shapes.
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Manual cherry-pick work to backport the changes merged in 4.16
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
`oc whoami` must work without the oauth-apiserver. There is an endpoint recently added to kube that allows congruent functionality.
Why is this important? (mandatory)
The oauth-apiserver does not control IdP information when external OIDC is used. this means the oauth-apiserver is no longer deployed. This causes `oc whoami` to fail.
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
As part of the deprecation progression of the openshift-sdn CNI plug-in, remove it as an install-time option for new 4.15+ release clusters.
The openshift-sdn CNI plug-in is sunsetting according to the following progression:
All development effort is directed to the default primary CNI plug-in, ovn-kubernetes, which has feature parity with the older openshift-sdn CNI plug-in that has been feature frozen for the entire 4.x timeframe. In order to best serve our customers now and in the future, we are reducing our support footprint to the dominant plug-in, only.
The openshift-sdn CNI plug-in is sunsetting according to the following progression:
All development effort is directed to the default primary CNI plug-in, ovn-kubernetes, which has feature parity with the older openshift-sdn CNI plug-in that has been feature frozen for the entire 4.x timeframe. In order to best serve our customers now and in the future, we are reducing our support footprint to the dominant plug-in, only.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Enrich the OpenShift Pipelines experience for DevSecOps and Software Supply Chain Security use cases such as CVEs, SBOMs and signatures.
Improving application developer experience when using OpenShift Pipelines by increasing awareness of important SSCS elements. An OpenShift Pipelines PipelineRun's Task can emit CVEs, SBOMs, policy reporting as well as identify signing status.
Enrich the OpenShift Pipelines experience for DevSecOps and Software Supply Chain Security use cases such as CVEs, SBOMs and signatures.
Improving application developer experience when using OpenShift Pipelines by increasing awareness of important SSCS elements. An OpenShift Pipelines PipelineRun's Task can emit CVEs, SBOMs, policy reporting as well as identify signing status.
Miro link - here
Project GUI enhancements doc
As a user, I want to see the SBOM link in the pipelinerun details page and if the pipelinerun is signed by chains then a signed badge should appear next to the pipelinerun name.
Tekton results annotation to be used - https://docs.google.com/document/d/1_1YXFx0ymzjl4b9M_LDjmmGrEYey5mDrfTdNn_56hpM/edit#heading=h.u0j4yw1zdczm
As a user, I want to see the vulnerabilities in the OCP console, so that I can identify and fix the issue as early as possible.
Tekton results naming conventions - doc
Batch the tekton results API request to avoid performance issues and use pagination to fetch the vulnerabilities when a user scrolls down in the list page.
Note: A pipelinerun can have multiple results SCAN_OUTPUT results.
As a user, I would like to see all the pipelinerun results in a new Output tab.
Slack thread - https://redhat-internal.slack.com/archives/C060FCC5KU1/p1699442229040389?thread_ts=1699441759.578729&cid=C060FCC5KU1
Address technical debt around self-managed HCP deployments, including but not limited to
The CLI cannot create dual stack clusters with the default values. We need to create the proper flags to enable the HostedCluster to be a dual stack one using the default values
This can be based on the exising CAPI agent provider workflow which already has an env var flag for disconnected
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
By default Agent provider is creating clusters delegating on the CLI, this is not bad, but if you don't define the UpgradeType as a CLI argument it will default to Replace which is basically the focus for cloud providers. We need to default the UpgradeType for Agent provider to InPlace but also respect the option set from the CLI. We also need to check with Kubevirt team what is the desired default.
Customer has escalated the following issues where ports don't have TLS support. This Feature request lists all the components port raised by the customer.
Details here https://docs.google.com/document/d/1zB9vUGB83xlQnoM-ToLUEBtEGszQrC7u-hmhCnrhuXM/edit
Currently, we are serving the metrics as http on 9258 we need to upgrade to use TLS
To solve this, use Kube RBAC proxy as a side container to proxy the metrics and provide an authentication/authorization layer for the metrics
Related to https://issues.redhat.com/browse/RFE-4665
CCMO metrics are currently exposed on a non-TLS server.
We should only expose the metrics via a TLS server.
Use Kube RBAC Proxy (as inspired by other components, eg MAO) to expose metrics via TLS, keeping non-TLS connections only on the localhost.
Currently, we are serving the metrics as http on 9191, and via TLS on 9192.
We need to make sure the metrics are only available on 9192 via TLS.
Related to https://issues.redhat.com/browse/RFE-4665
CMA currently exposes metrics on two ports via the 0.0.0.0 all hosts binding. We need to make sure that only the TLS port is accessible from outside localhost.
Epic Goal*
There was an epic / enhancement to create a cluster-wide TLS config that applies to all OpenShift components:
https://issues.redhat.com/browse/OCPPLAN-4379
https://github.com/openshift/enhancements/blob/master/enhancements/kube-apiserver/tls-config.md
For example, this is how KCM sets --tls-cipher-suites and --tls-min-version based on the observed config:
https://issues.redhat.com/browse/WRKLDS-252
https://github.com/openshift/cluster-kube-controller-manager-operator/pull/506/files
The cluster admin can change the config based on their risk profile, but if they don't change anything, there is a reasonable default.
We should update all CSI driver operators to use this config. Right now we have a hard-coded cipher list in library-go. See OCPBUGS-2083 and OCPBUGS-4347 for background context.
Why is this important? (mandatory)
This will keep the cipher list consistent across many OpenShift components. If the default list is changed, we get that change "for free".
It will reduce support calls from customers and backport requests when the recommended defaults change.
It will provide flexibility to the customer, since they can set their own TLS profile settings without requiring code change for each component.
Scenarios (mandatory)
As a cluster admin, I want to use TLSSecurityProfile to control the cipher list and minimum TLS version for all CSI driver operator sidecars, so that I can adjust the settings based on my own risk assessment.
Dependencies (internal and external) (mandatory)
None, the changes we depend on were already implemented.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Goal:
As an administrator, I would like to use my own managed DNS solution instead of only specific openshift-install supported DNS services (such as AWS Route53, Google Cloud DNS, etc...) for my OpenShift deployment.
Problem:
While cloud-based DNS services provide convenient hostname management, there's a number of regulatory (ITAR) and operational constraints customers face prohibiting the use of those DNS hosting services on public cloud providers.
Why is this important:
Dependencies (internal and external):
Prioritized epics + deliverables (in scope / not in scope):
Estimate (XS, S, M, L, XL, XXL):
Previous Work:
Open questions:
Link to Epic: https://docs.google.com/document/d/1OBrfC4x81PHhpPrC5SEjixzg4eBnnxCZDr-5h3yF2QI/edit?usp=sharing
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Append Infra CR with only the GCP PlatformStatus field (without any other fields esp the Spec) set with the LB IPs at the end of the bootstrap ignition. The theory is that when Infra CR is applied from the bootstrap ignition, first the infra manifest is applied. As we progress through all the other assets in the ignition files, Infra CR appears again but with only the LB IPs set. That way it will update the existing Infra CR already applied to the cluster.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Use the API, API-Int and Ingress LB IPs within GCPPlatformStatus instead of the `lb-config` ConfigMap to generate the in cluster CoreDNS pods.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
This Epic details that work required to augment this CoreDNS pod to also resolve the *.apps URL. In addition, it will include changes to prevent Ingress Operator from configuring the cloud DNS after the ingress LBs have been created.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
If a zone has not been granted permission to be shared across projects (if in different projects), then the install will fail.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
If a zone has not been granted permission to be shared across projects (if in different projects), then the install will fail.
Support ARO and customers aspiring to follow Microsoft Azure security recommendations by allowing the Azure storage account hosting the object storage bucket for the integrated registry to be configured as "private" vs. the default public.
OpenShift installations on Azure can be configured so that they don't trigger the Azure's Security Advisor anymore with regard to the use of a public endpoint for the Azure storage account created by the integrated registry operator.
Several users noticed warnings in Azure Security advisor reporting the potentially dangerous exposure of the storage endpoint used by the integrated registry configured by its operator. There is no real security threat here because despite the endpoint being public, access to it is strictly locked down to a single set of credentials used by the internal registry only.
Still customers need to be able to deploy cluster that out of the box do not violate Microsoft security recommendations.
This feature sets the foundation for OCPSTRAT-997 to be delivered.
Customers updating to the version of OpenShift that delivers this feature shall not have their integrated registry configuration updated automatically.
We require documentation in the section for the integrated registry operator that describes how to manually configure the vnet and subnet that shall be used for the private storage endpoint in case the customer wants to leverage an network resource group account different from the cluster.
We also require documentation that describes the single tunable for the integrated registry operator that is required to be set to "internal" to automate the detection of existing vnet and subnets in the network resource group of the cluster as opposed to manual specification of a user-defined vnet/subnet pair.
Story: As an image registry developer, I want to be able to programmatically create a private endpoint in Azure without having to worry about explicitly creating the supporting objects, so that I can easily enable support for private storage accounts in Azure.
ACCEPTANCE CRITERIA
When the registry is in private mode:
DOCUMENTATION
The installer currently does not tag subnets and vnet pre-created by users (and by itself?) on Azure.
Background
When "Publish: Internal", the image registry operator needs to discover subnets and vnet the cluster was provisioned with, so that it can provision a private storage account for the registry to use.
Story: As a user, I want to be able to configure the registry operator to use Azure Private Endpoints so that I can deploy the registry on Azure without a public facing endpoint.
ACCEPTANCE CRITERIA
DOCUMENTATION
Reduce the OpenShift platform and associated RH provided components to a single physical core on Intel Sapphire Rapids platform for vDU deployments on SingleNode OpenShift.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
Provide a mechanism to tune the platform to use only one physical core. |
Users need to be able to tune different platforms. | YES |
Allow for full zero touch provisioning of a node with the minimal core budget configuration. | Node provisioned with SNO Far Edge provisioning method - i.e. ZTP via RHACM, using DU Profile. | YES |
Platform meets all MVP KPIs | YES |
Questions to be addressed:
When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release
Goal: Establish OVN as THE SDN for CNV to meet modern virtualization network needs.
This MUST be closely aligned to the OCP OVN effort.
With
Allow administrators to create new Network Attachment Definitions for OVN Kubernetes secondary localnet networks.
apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: name: <name> namespace: <namespace> spec: config: |2 { "cniVersion": "0.4.0", "name": "<bridge mapping>", "type": "ovn-k8s-cni-overlay", "topology":"localnet", "vlanID": <VLAN>, # set only if passed from the user "mtu": <MTU>, # set only if passed from the user "netAttachDefName": "<namespace>/<name>" }
Some of the code under the web console's /frontend/public/components/monitoring/ dir is no longer used, so it can be removed.
There may also be some code in the redux actions and reducers that are no longer used that can also be removed.
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision PowerVS infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Generate the machine manifests. Follow how it's done for AWS [0] and the PoC code [1]
[0] https://github.com/openshift/installer/blob/master/pkg/asset/machines/aws/awsmachines.go
endpoint overrides will be used by cluster operators for disconnected scenario
Implement an intitial PowerVS provider for CAPI.
When publishStrategy is Internal, we need to create the api and api-int records against an IBM DNS service instead of CIS.
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for OpenStack deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision OpenStack infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Move CAPO (cluster-api-provider-openstack) to a stable API.
Currently OpenShift on OpenStack is using MAPO. This uses objects from the upstream CAPO project under the hood but not the APIs. We would like to start using CAPO and declare MAPO as deprecated and frozen, but before we do that upstream CAPO's own API needs to be declared stable.
Upstream CAPO's API is currently at v1alpha6. There are a number of incompatible changes already planned for the API which have prevented us from declaring it v1beta1. We should make those changes and move towards a stable API.
The changes need to be accompanied by an improvement in test coverage of API versions.
Upstream issues targeted for v1beta1 should be tracked in the v0.7 milestone: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues?q=is%3Aopen+is%3Aissue+milestone%3Av0.7
Another option is to switch to cluster-capi-operator if it graduates, which would mean only a single API would be maintained.
N/A. This is purely upstream work for now. We will directly benefit from this work once we switch to CAPO in a future release.
Upstream CAPO provides a v1beta1 API
Upstream CAPO includes e2e tests using envtest (https://book.kubebuilder.io/reference/envtest.html) which will allow us to avoid breaks in API compatibility
None.
N/A
In our way to move forward, we need to bump CAPO into MAPO from v1alpha6 to v1alpha7.
\
This is needed to identify if masters are schedulable and to upload the rhos image to glance.
We need to get CI on this PR in good shape https://github.com/openshift/installer/pull/7939 so we can look for reviews
Right now when trying one installation with this work https://github.com/openshift/installer/pull/7939 the bootstrap machine is not getting deleted. We need to ensure it's gone once bootstrap is finalized.
Rebase Installer onto the development branch of cluster-api-provider-openstack to provide CI signal to the CAPO maintainers.
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision IBM Cloud VPC infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer - specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision GCP infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Create the GCP Infrastructure controller in /pkg/clusterapi/system.go.
It will be based on the AWS controller in that file, which was added in https://github.com/openshift/installer/pull/7630.
Description of problem:
The bootstrap machine never contains a public IP address. When the publish strategy is set to External, the bootstrap machine should contain a public ip address.
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
When GCP workers are created they are not able to pull ignition over the internal subnet as its not allowed by the firewall rules created by CAPG. The allow-<infraID>-cluster allows all TCP traffic with tags for <infraID>-node and <infraID>-control-plane but the workers that are created have tags <infraID>-worker.
We need to either add the worker tags to this firewall rule or add node tags to the worker. We should decide on a general use of CAPG firewall rules.
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
When installing on GCP, I want control-plane (including bootstrap) machines to bootstrap using ignition.
I want bootstrap ignition to be secured so that sensitive data is not publicly available.
Description of criteria:
Destroying bootstrap ignition can be handled separately.
This requires/does not require a design proposal.
This requires/does not require a feature gate.
I want to create the public and private DNS records using one of the CAPI interface SDK hooks.
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
When a GCP cluster is created using CAPI, upon destroy the addresses associated with the apiserver LoadBalancer are not removed. For example here are addresses left over after previous installations
$ gcloud compute addresses list --uri | grep bfournie https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-27kzq-apiserver https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-6jrwz-apiserver https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-gn6g7-apiserver https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-h96j2-apiserver https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-k7fdj-apiserver https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-nh4z5-apiserver https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-nls2h-apiserver https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-qrhmr-apiserver
Here is one of the addresses:
$ gcloud compute addresses describe https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-27kzq-apiserver address: 34.107.255.76 addressType: EXTERNAL creationTimestamp: '2024-04-15T15:17:56.626-07:00' description: '' id: '2697572183218067835' ipVersion: IPV4 kind: compute#address labelFingerprint: 42WmSpB8rSM= name: bfournie-capg-test-27kzq-apiserver networkTier: PREMIUM selfLink: https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-27kzq-apiserver status: RESERVED [bfournie@bfournie installer-patrick-new]$ gcloud compute addresses describe https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-6jrwz-apiserver address: 34.149.208.133 addressType: EXTERNAL creationTimestamp: '2024-03-27T09:35:00.607-07:00' description: '' id: '1650865645042660443' ipVersion: IPV4 kind: compute#address labelFingerprint: 42WmSpB8rSM= name: bfournie-capg-test-6jrwz-apiserver networkTier: PREMIUM selfLink: https://www.googleapis.com/compute/v1/projects/openshift-dev-installer/global/addresses/bfournie-capg-test-6jrwz-apiserver status: RESERVED
https://issues.redhat.com/browse/CORS-3217 covers the upstream chagnes to CAPG needed to add disk encrytion. In addition changes will be needed in the installer to set the GCPMachine disk encryption based on the machinepool settings.
Notes on the required changes are at https://docs.google.com/document/d/1kVgqeCcPOrq4wI5YgcTZKuGJo628dchjqCrIrVDS83w/edit?usp=sharing
Once the upstream changes from CORS-3217 have been accepted:
I want to destroy the load balancers created by capg
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
When using the CAPG provider the ServiceAccounts created by the installer for the master and worker nodes do not have the role bindings added correctly.
For example this query shows that the SA for the master nodes has no role bindings.
$ gcloud projects get-iam-policy openshift-dev-installer --flatten="bindings[].members" --format='table(bindings.role)' --filter='bindings.members:bfournie-capg-test-lk5t5-m@openshift-dev-installer.iam.gserviceaccount.com' $
As an installer user, I want my gcp creds used for install to be used by the CAPG controller when provisioning resources.
Acceptance Criteria:
This requires/does not require a design proposal.
This requires/does not require a feature gate.
CAPG has been updated to 1.6, see https://github.com/kubernetes-sigs/cluster-api-provider-gcp/releases/tag/v1.6.0
We need to pick this up to get the latest features including disk encryption.
I want the installer to create the service accounts that would be assigned to control plane and compute machines, similar to what is done in terraform now.
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Machines for GCP need to be generated for use in CAPI. This will be similar to the AWS machine implementation
(https://github.com/openshift/installer/blob/master/pkg/asset/machines/aws/awsmachines.go) added in
https://github.com/openshift/installer/pull/7771
When testing GCP using the CAPG provider (not Terraform) in 4.16, it was found that the master VM instances were not distributed across instance groups but were all assigned to the same instance group.
Here is a (partial) CAPG install vs a installation completed using Terraform. The capg installation (bfournie-capg-test-5ql8j) has VMs all using us-east1-b
$ gcloud compute instances list | grep bfournie bfournie-capg-test-5ql8j-bootstrap us-east1-b n2-standard-4 10.0.0.4 34.75.212.239 RUNNING bfournie-capg-test-5ql8j-master-0 us-east1-b n2-standard-4 10.0.0.5 RUNNING bfournie-capg-test-5ql8j-master-1 us-east1-b n2-standard-4 10.0.0.6 RUNNING bfournie-capg-test-5ql8j-master-2 us-east1-b n2-standard-4 10.0.0.7 RUNNING bfournie-test-tf-pdrsw-master-0 us-east4-a n2-standard-4 10.0.0.4 RUNNING bfournie-test-tf-pdrsw-worker-a-vxjbk us-east4-a n2-standard-4 10.0.128.2 RUNNING bfournie-test-tf-pdrsw-master-1 us-east4-b n2-standard-4 10.0.0.3 RUNNING bfournie-test-tf-pdrsw-worker-b-ksxfg us-east4-b n2-standard-4 10.0.128.3 RUNNING bfournie-test-tf-pdrsw-master-2 us-east4-c n2-standard-4 10.0.0.5 RUNNING bfournie-test-tf-pdrsw-worker-c-jpzd5 us-east4-c n2-standard-4 10.0.128.4 RUNNING
I want to create a load balancer to provide split-horizon DNS for the cluster.
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision AWS infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Use cases to ensure:
Goal:
Issue:
Steps to reproduce:
Actual results:
Expected results:
References:
security group ids are added to control plane nodes when installconfig.controlPlane.platform.aws.additionalSecurityGroupIDs is specified
As a (user persona), I want to be able to:
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
when installconfig.platform.aws.userTags is specified, all taggable resources should have the specified user tags.
CAPA shows
I0312 18:00:13.602972 109 s3.go:220] "Deleting S3 object" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="openshift-cluster-api-guests/rdossant-installer-03-jjf6b-master-2" namespace="openshift-cluster-api-guests" name="rdossant-installer-03-jjf6b-master-2" reconcileID="9cda22be-5acd-4670-840f-8a6708437385" machine="openshift-cluster-api-guests/rdossant-installer-03-jjf6b-master-2" cluster="openshift-cluster-api-guests/rdossant-installer-03-jjf6b" bucket="openshift-bootstrap-data-rdossant-installer-03-jjf6b" key="control-plane/rdossant-installer-03-jjf6b-master-2" I0312 18:00:13.608919 109 s3.go:220] "Deleting S3 object" controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="openshift-cluster-api-guests/rdossant-installer-03-jjf6b-master-0" namespace="openshift-cluster-api-guests" name="rdossant-installer-03-jjf6b-master-0" reconcileID="1ed0ad52-ffc1-4b62-97e4-876f8e8c3242" machine="openshift-cluster-api-guests/rdossant-installer-03-jjf6b-master-0" cluster="openshift-cluster-api-guests/rdossant-installer-03-jjf6b" bucket="openshift-bootstrap-data-rdossant-installer-03-jjf6b" key="control-plane/rdossant-installer-03-jjf6b-master-0" [...] E0312 18:04:25.282967 109 awsmachine_controller.go:576] "controllers/AWSMachine: unable to delete secrets" err=< deleting bootstrap data object: deleting S3 object: NotFound: Not Found status code: 404, request id: 9QYY3QSWKBBDZ7R8, host id: 2f3HawFbPheaptP9E+WRbu3fhEXTMwyZQ1DBPGBG7qlg74ssQR0XISM4OSlxvrn59GeFREtN4hp9C+S5LgQD2g== > E0312 18:04:25.284197 109 controller.go:329] "Reconciler error" err=< deleting bootstrap data object: deleting S3 object: NotFound: Not Found status code: 404, request id: 9QYY3QSWKBBDZ7R8, host id: 2f3HawFbPheaptP9E+WRbu3fhEXTMwyZQ1DBPGBG7qlg74ssQR0XISM4OSlxvrn59GeFREtN4hp9C+S5LgQD2g== > controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="openshift-cluster-api-guests/rdossant-installer-03-jjf6b-master-0" namespace="openshift-cluster-api-guests" name="rdossant-installer-03-jjf6b-master-0" reconcileID="7fac94a1-772a-4c7b-a631-5ef7fc015d5b" E0312 18:04:25.286152 109 awsmachine_controller.go:576] "controllers/AWSMachine: unable to delete secrets" err=< deleting bootstrap data object: deleting S3 object: NotFound: Not Found status code: 404, request id: 9QYPFY0EQBM42VYH, host id: nJZakAhLrbZ1xrSNX3tyk0IKmMgFjsjMSs/D9nzci90GfRNNfUnvwZTbcaUBQYiuSlY5+aysCuwejWpvi8FmGusbQCK1Qtjr9pjqDQfxzY4= > E0312 18:04:25.287353 109 controller.go:329] "Reconciler error" err=< deleting bootstrap data object: deleting S3 object: NotFound: Not Found status code: 404, request id: 9QYPFY0EQBM42VYH, host id: nJZakAhLrbZ1xrSNX3tyk0IKmMgFjsjMSs/D9nzci90GfRNNfUnvwZTbcaUBQYiuSlY5+aysCuwejWpvi8FmGusbQCK1Qtjr9pjqDQfxzY4= > controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="openshift-cluster-api-guests/rdossant-installer-03-jjf6b-master-2" namespace="openshift-cluster-api-guests" name="rdossant-installer-03-jjf6b-master-2" reconcileID="b6c792ad-5519-48d5-a994-18dda76d8a93" E0312 18:04:25.291383 109 awsmachine_controller.go:576] "controllers/AWSMachine: unable to delete secrets" err=< deleting bootstrap data object: deleting S3 object: NotFound: Not Found status code: 404, request id: 9QYGWSJDR35Q4GWX, host id: Qnltg++ia3VapXjtENZOQIwfAxbxfwVLPlC0DwcRBx+L60h52ENiNqMOkvuNwJyYnPxbo/CaawzMT11oIKGO9g== > E0312 18:04:25.292132 109 controller.go:329] "Reconciler error" err=< deleting bootstrap data object: deleting S3 object: NotFound: Not Found status code: 404, request id: 9QYGWSJDR35Q4GWX, host id: Qnltg++ia3VapXjtENZOQIwfAxbxfwVLPlC0DwcRBx+L60h52ENiNqMOkvuNwJyYnPxbo/CaawzMT11oIKGO9g== > controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="openshift-cluster-api-guests/rdossant-installer-03-jjf6b-master-1" namespace="openshift-cluster-api-guests" name="rdossant-installer-03-jjf6b-master-1" reconcileID="92e1f8ed-b31f-4f75-9083-59aad15efe79" E0312 18:04:25.679859 109 awsmachine_controller.go:576] "controllers/AWSMachine: unable to delete secrets" err=< deleting bootstrap data object: deleting S3 object: NotFound: Not Found status code: 404, request id: 9QYSBZGYPC7SNJEX, host id: EplmtNQ+RxmbU88z+4App6YEVvniJpyCeMiMZuUegJIMqZgbkA1lmCjHntSLDm4eA857OdhtHsn+zD6AX7uelGIsogzN2ZziiAZXZrbIIEg= > E0312 18:04:25.680663 109 controller.go:329] "Reconciler error" err=< deleting bootstrap data object: deleting S3 object: NotFound: Not Found status code: 404, request id: 9QYSBZGYPC7SNJEX, host id: EplmtNQ+RxmbU88z+4App6YEVvniJpyCeMiMZuUegJIMqZgbkA1lmCjHntSLDm4eA857OdhtHsn+zD6AX7uelGIsogzN2ZziiAZXZrbIIEg= > controller="awsmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSMachine" AWSMachine="openshift-cluster-api-guests/rdossant-installer-03-jjf6b-master-0" namespace="openshift-cluster-api-guests" name="rdossant-installer-03-jjf6b-master-0" reconcileID="9e436c67-aca0-409c-9179-0ce4cccce9ad"
Even though we are not creating s3 buckets for the master nodes. That's preventing the bootstrap process from finishing.
Because of the assumption that subnets have auto-assign public IPs turned on, which is how CAPA configures the subnets it creates, supplying your own VPC where that is not the case causes the bootstrap node to not get a public IP and therefore not be able to download the release image (no internet connection).
The bootstrap node needs a public IP because the public subnets are connected only to the internet gateway, which does not provide NAT.
Destroy all bootstrap resources created through the new non-terraform provider.
Acceptance Criteria:
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
iam role is correctly attached to control plane node when installconfig.controlPlane.platform.aws.iamRole is specified
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
CAPA creates 4 security groups:
$ aws ec2 describe-security-groups --region us-east-2 --filters "Name = group-name, Values = *rdossant*" --query "SecurityGroups[*].[GroupName]" --output text rdossant-installer-03-tvcbd-lb rdossant-installer-03-tvcbd-controlplane rdossant-installer-03-tvcbd-apiserver-lb rdossant-installer-03-tvcbd-node
Given that the maximum number of SGs in a network interface is 16, we should update the max number validation in the installer:
https://github.com/openshift/installer/blob/master/pkg/types/aws/validation/machinepool.go#L66
Patrick says:
I think we want to update this to cap the user limit to 10 additional security groups:
More context: https://redhat-internal.slack.com/archives/C68TNFWA2/p1697764210634529?thread_ts=1697471429.293929&cid=C68TNFWA2
As a (user persona), I want to be able to:
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
when installconfig.controlPlane.platform.aws.metadataService is set, the metadataservice is correctly configured for control plane machines
The schema check[1] in the LB reconciliation is hardcoded to check the primary Load Balancer only, it will result to always filter the subnets from the schema for the primary, ignoring additional Load Balancers ("SecondaryControlPlaneLoadBalancer")
How to reproduce:
Actual results:
Expected results:
References:
Private hosted zone and cross-account shared vpc works when installconfig.platform.aws.hostedZone is specified
Issue:
Steps to reproduce:
Actual results:
Expected results:
References:
Epic Goal
Why is this important?
Acceptance Criteria
Done Checklist
As a multiarch CI-focused engineer, I want to create a workflow in `openshift/release` that will enable creating the backend nodes for a cluster installation.
Epic Goal
Through this epic, we will update our CI to use a have an available agent-based workflow instead of the libvirt openshift-installer, allowing us to eliminate the use of terraform in our deployments.
Why is this important?
There is an active initiative in openshift to remove terraform from the openshift installer.
Acceptance Criteria
Done Checklist
As a CI job author, I would like to be able to reference a yaml/json parsing tool that works across architectures and doesn't need to be downloaded for each unique step.
Rafael pointed out that Alessandro add multi-arch containers for yq for the upi installer:
https://github.com/openshift/release/pull/49036#discussion_r1499870554
yq should have the ability to parse json.
We should evaluate if this can be added to the libvirt-installer image as well, and then used by all of our libvirt CI steps.
Customer has escalated the following issues where ports don't have TLS support. This Feature request lists all the components port raised by the customer.
Details here https://docs.google.com/document/d/1zB9vUGB83xlQnoM-ToLUEBtEGszQrC7u-hmhCnrhuXM/edit
Currently, we are serving the metrics as http on 9537 we need to upgrade to use TLS
Related to https://docs.google.com/document/d/1zB9vUGB83xlQnoM-ToLUEBtEGszQrC7u-hmhCnrhuXM/edit
Enhance Dynamic plugin with similar capabilities as Static page. Add new control and security related enhancements to Static page.
The Dynamic plugin should list pipelines similar to the current static page.
The Static page should allow users to override task and sidecar task parameters.
The Static page should allow users to control tasks that are setup for manual approval.
The TSSC security and compliance policies should be visible in Dynamic plugin.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
With the OpenShift Pipelines operator 1.2x we added support for a dynamic console plugin to the operator. In the first version it is only responsible for the Dashboard and Pipeline/Repository Metrics tab. We want move more and more code to the dynamic plugin and remove this later from the console repository.
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision OpenShift on the existing supported providers' infrastructure without the use of Terraform.
This feature will be used to track all the CAPI preparation work that is common for all the supported providers
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Write CAPI manifests to disk during create manifests so that they can be user edited and users can also provide their own set of manifests. In general, we think of manifests as an escape hatch that should be used when a feature is missing from the install config, and users accept the degraded user experience of editing manifests in order to achieve non-install-config-supported functionality.
Acceptance criteria:
Manifests should be generated correctly (and applied correctly to the control plane):
There is some WIP for this, but there are issues with the serialization/deserialization flow when writing the GVK in the manifests.
Fit provisioning via the CAPI system into the infrastructure.Provider interface
so that:
PoC & design for running CAPI control plane using binaries.
As a CAPI install user, I want to be able to:
so that I can achieve
Description of criteria:
This is intended to be platform agnostic. If there is a common way for obtaining ip addresses from capi manifests, this should be sufficient. Otherwise, this should enable other platforms to implement their specific logic.
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Allow setting custom tags to machines created during the installation of an OpenShift cluster on vSphere.
Just as labeling is important in Kubernetes for organizing API objects and compute workloads (pods/containers), the same is true for the Kube/OCP node VMs running on the underlying infrastructure in any hosted or cloud platform.
Reporting, auditing, troubleshooting and internal organization processes all require ways of easily filtering on and referencing servers by naming, labels or tags. Ensuring appropriate tagging is added to all OCP nodes in VMware ensures those troubleshooting, reporting or auditing can easily identify and filter Openshift node VMs.
For example we can use tags for prod vs. non-prod, VMs that should have backup snapshots vs. those that shouldn't, VMs that fall under certain regulatory constraints, etc.
Reporting, auditing, troubleshooting and internal organization processes all require ways of easily filtering on and referencing servers by naming, labels or tags. Ensuring appropriate tagging is added to all OCP nodes in VMware ensures those troubleshooting, reporting or auditing can easily identify and filter Openshift node VMs.
For example we can use tags for prod vs. non-prod, VMs that should have backup snapshots vs. those that shouldn't, VMs that fall under certain regulatory constraints, etc.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
We have decided to remove IPI and UPI support for Alibaba Cloud, which until recently has been in Tech Preview due to the following reasons:
(1) Low customer interest of using Openshift on Alibaba Cloud
(2) Removal of Terraform usage
(3) MAPI to CAPI migration
(4) CAPI adoption for installation (Day 1) in OpenShift 4.16 and Day 2 (post-OpenShift 4.16)
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
Impacted areas based on CI:
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Self-managed |
Classic (standalone cluster) | Classic (standalone) |
Hosted control planes | N/A |
Multi node, Compact (three node), or Single node (SNO), or all | All |
Connected / Restricted Network | All |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | All |
Operator compatibility | N/A |
Backport needed (list applicable versions) | N/A |
Other (please specify) | N/A |
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
<your text here>
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
<your text here>
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
We have decided to remove IPI and UPI support for Alibaba Cloud, which until recently has been in Tech Preview due to the following reasons:
(1) Low customer interest of using Openshift on Alibaba Cloud
(2) Removal of Terraform usage
(3) MAPI to CAPI migration
(4) CAPI adoption for installation (Day 1) in OpenShift 4.16 and Day 2 (post-OpenShift 4.16)
Impacted areas based on CI:
alibaba-cloud-csi-driver/openshift-alibaba-cloud-csi-driver-release-4.16.yaml
alibaba-disk-csi-driver-operator/openshift-alibaba-disk-csi-driver-operator-release-4.16.yaml
cloud-provider-alibaba-cloud/openshift-cloud-provider-alibaba-cloud-release-4.16.yaml
cluster-api-provider-alibaba/openshift-cluster-api-provider-alibaba-release-4.16.yaml
cluster-cloud-controller-manager-operator/openshift-cluster-cloud-controller-manager-operator-release-4.16.yaml
machine-config-operator/openshift-machine-config-operator-release-4.16.yaml
Enable the OCP Console to send back user analytics to our existing endpoints in console.redhat.com. Please refer to doc for details of what we want to capture in the future:
Collect desired telemetry of user actions within OpenShift console to improve knowledge of user behavior.
OpenShift console should be able to send telemetry to a pre-configured Red Hat proxy that can be forwarded to 3rd party services for analysis.
User analytics should respect the existing telemetry mechanism used to disable data being sent back
Need to update existing documentation with what we user data we track from the OCP Console: https://docs.openshift.com/container-platform/4.14/support/remote_health_monitoring/about-remote-health-monitoring.html
Capture and send desired user analytics from OpenShift console to Red Hat proxy
Red Hat proxy to forward telemetry events to appropriate Segment workspace and Amplitude destination
Use existing setting to opt out of sending telemetry: https://docs.openshift.com/container-platform/4.14/support/remote_health_monitoring/opting-out-of-remote-health-reporting.html#opting-out-remote-health-reporting
Also, allow just disabling user analytics without affecting the rest of telemetry: Add annotation to the Console to disbale just user analytics
Update docs to show this method as well.
We will require a mechanism to store all the segment values
We need to be able to pass back orgID that we receive from the OCM subscription API call
Sending telemetry from OpenShift cluster nodes
Console already has support for sending analytics to segment.io in Dev Sandbox and OSD environments. We should reuse this existing capability, but default to http://console.redhat.com/connections/api for analytics and http://console.redhat.com/connections/cdn to load the JavaScript in other environments. We must continue to allow Dev Sandbox and OSD clusters a way to configure their own segment key, whether telemetry is enabled, segment API host, and other options currently set as annotations on the console operator configuration resource.
Console will need a way to determine the org-id to send with telemetry events. Likely the console operator will need to read this from the cluster pull secret.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
The console telemetry plugin needs to send data to a new Red Hat ingress point that will then forward it to Segment for analysis.
Goal:
Update console telemetry plugin to send data to the appropriate ingress point.
Ingress point created for console.redhat.com
This is a clone of issue OCPBUGS-25722. The following is the description of the original issue:
—
The console telemetry plugin needs to send data to a new Red Hat ingress point that will then forward it to Segment for analysis.
For that the telemetry-console-plugin must have options to configure where it loads the analytics.js and where to send the API calls (analytics events).
Hypershift-provisioned clusters, regardless of the cloud provider support the proposed integration for OLM-managed integration outlined in OCPBU-559 and OCPBU-560.
There is no degradation in capability or coverage of OLM-managed operators support short-lived token authentication on cluster, that are lifecycled via Hypershift.
Currently, Hypershift lacks support for CCO.
Currently, Hypershift will be limited to deploying clusters in which the cluster core operators are leveraging short-lived token authentication exclusively.
If we are successful, no special documentation should be needed for this.
Outcome Overview
Operators on guest clusters can take advantage of the new tokenized authentication workflow that depends on CCO.
Success Criteria
CCO is included in HyperShift and its footprint is minimal while meeting the above outcome.
Expected Results (what, how, when)
Post Completion Review – Actual Results
After completing the work (as determined by the "when" in Expected Results above), list the actual results observed / measured during Post Completion review(s).
This is a clone of issue CCO-388. The following is the description of the original issue:
—
Every guest cluster should have a running CCO pod with its kubeconfig attached to it.
Enchancement doc: https://github.com/openshift/enhancements/blob/master/enhancements/cloud-integration/tokenized-auth-enablement-operators-on-cloud.md
CCO logic for managing webhooks is a) entirely separate from the core functionality of the CCO and b) requires a lot of extra RBAC. In deployment topologies like HyperShift, we don't want this additional functionality and would like to be able to cleanly turn it off and remove the excess RBAC.
The default OpenShift installation on AWS uses multiple IPv4 public IPs which Amazon will start charging for starting in February 2024. As a result, there is a requirement to find an alternative path for OpenShift to reduce the overall cost of a public cluster while this is deployed on AWS public cloud.
Provide an alternative path to reduce the new costs associated with public IPv4 addresses when deploying OpenShift on AWS public cloud.
There is a new path for "external" OpenShift deployments on AWS public cloud where the new costs associated with public IPv4 addresses have a minimum impact on the total cost of the required infrastructure on AWS.
Ongoing discussions on this topic are happening in Slack in the #wg-aws-ipv4-cost-mitigation private channel
Usual documentation will be required in case there are any new user-facing options available as a result of this feature.
*Resources which consumes public IPv4: bootstrap, API Public NLB, Nat Gateways
USER STORY:
DESCRIPTION:
<!--
Provide as many details as possible, so that any team member can pick it up
and start to work on it immediately without having to reach out to you.
-->
Required:
Nice to have:
...
ACCEPTANCE CRITERIA:
ENGINEERING DETAILS:
-
Enable installation and lifecycle support of OpenShift 4 on Oracle Cloud Infrastructure (OCI) Bare metal
Use scenarios
Why is this important
Requirement | Notes |
---|---|
OCI Bare Metal Shapes must be certified with RHEL | It must also work with RHCOS (see iSCSI boot notes) as OCI BM standard shapes require RHCOS iSCSI to boot (OCPSTRAT-1246) Certified shapes: https://catalog.redhat.com/cloud/detail/249287 |
Successfully passing the OpenShift Provider conformance testing – this should be fairly similar to the results from the OCI VM test results. | Oracle will do these tests. |
Updating Oracle Terraform files | |
Making the Assisted Installer modifications needed to address the CCM changes and surface the necessary configurations. | Support Oracle Cloud in Assisted-Installer CI: |
RFEs:
Any bare metal Shape to be supported with OCP has to be certified with RHEL.
From the certified Shapes, those that have local disks will be supported. This is due to the current lack of support in RHCOS for the iSCSI boot feature. OCPSTRAT-749 is tracking adding this support and remove this restriction in the future.
As of Aug 2023 this excludes at least all the Standard shapes, BM.GPU2.2 and BM.GPU3.8, from the published list at: https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#baremetalshapes
During 4.15, the OCP team is working on allowing booting from iscsi. Today that's disabled by the assisted installer. The goal is to enable that for ocp version >= 4.15 when using OCI external platform.
iscsi boot is enabled for ocp version >= 4.15 both in the UI and the backend.
When booting from iscsi, we need to make sure to add the `rd.iscsi.firmware=1 ip=ibft` kargs during install to enable iSCSI booting.
yes
Assisted Service should allow booting from iSCSI for x86_64 OpenShift versions at least 4.15.0.
Multipath is not supported at this time.
When the Assisted agent boots, it should connect to iBFT iSCSI targets
Please describe what this feature is going to do.
Please describe what conditions must be met in order to mark this feature as "done".
If the answer is "yes", please make sure to check the corresponding option.
https://github.com/openshift/installer/pull/7457 introduces a change of behavior, the Cloud Controller Manager will be disabled by default.
We need to explicitly enable it when deploying on oci platform.
The MCO should properly report its state in a way that's consistent and able to be understood by customers, troubleshooters, and maintainers alike.
Some customer cases have revealed scenarios where the MCO state reporting is misleading and therefore could be unreliable to base decisions and automation on.
In addition to correcting some incorrect states, the MCO will be enhanced for a more granular view of update rollouts across machines.
Similar to bug 1955300, but seen in a recent 4.11-to-4.11 update [1]:
: [bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Available
Run #0: Failed expand_less 47m16s
1 unexpected clusteroperator state transitions during e2e test run
Feb 05 22:15:40.430 - 1044s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [
{operator 4.11.0-0.nightly-2022-02-05-152519}]]
Feb 05 17:21:15.357 - 1087s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [
]
Feb 05 09:31:14.667 - 1632s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [
]
Feb 05 12:29:22.119 - 1060s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.8.0-0.okd-2022-02-05-101655
Feb 05 17:43:45.938 - 1380s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for 4.6.54
Feb 06 02:35:34.300 - 1085s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [
]
Feb 06 06:15:23.991 - 1135s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [
]
Feb 05 09:25:22.083 - 1071s E clusteroperator/machine-config condition/Available status/False reason/Cluster not available for [
]
Breaking down by job name:
$ w3m -dump -cols 200 'https://search.ci.openshift.org?maxAge=24h&type=junit&context=0&search=s+E+clusteroperator/machine-config+condition/Available+status/False' | grep 'failures match' | sort
periodic-ci-openshift-multiarch-master-nightly-4.8-upgrade-from-nightly-4.7-ocp-remote-libvirt-s390x (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-aws-ovn-upgrade (all) - 70 runs, 47% failed, 6% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-azure-ovn-upgrade (all) - 40 runs, 60% failed, 4% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.10-e2e-gcp-upgrade (all) - 76 runs, 42% failed, 9% of failures match = 4% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade (all) - 77 runs, 65% failed, 4% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-ovn-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-aws-upgrade-rollback (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-gcp-ovn-upgrade (all) - 41 runs, 61% failed, 12% of failures match = 7% impact
periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-ovirt-upgrade (all) - 4 runs, 75% failed, 33% of failures match = 25% impact
periodic-ci-openshift-release-master-ci-4.11-e2e-aws-ovn-upgrade (all) - 80 runs, 59% failed, 4% of failures match = 3% impact
periodic-ci-openshift-release-master-ci-4.11-e2e-gcp-upgrade (all) - 82 runs, 51% failed, 7% of failures match = 4% impact
periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-aws-ovn-upgrade (all) - 88 runs, 55% failed, 8% of failures match = 5% impact
periodic-ci-openshift-release-master-ci-4.11-upgrade-from-stable-4.10-e2e-azure-upgrade (all) - 79 runs, 54% failed, 2% of failures match = 1% impact
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-from-stable-4.7-from-stable-4.6-e2e-aws-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade (all) - 45 runs, 44% failed, 25% of failures match = 11% impact
periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade (all) - 33 runs, 45% failed, 13% of failures match = 6% impact
periodic-ci-openshift-release-master-nightly-4.11-e2e-metal-ipi-upgrade (all) - 3 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-release-master-okd-4.10-e2e-vsphere (all) - 6 runs, 100% failed, 17% of failures match = 17% impact
pull-ci-openshift-cluster-authentication-operator-master-e2e-agnostic-upgrade (all) - 8 runs, 100% failed, 13% of failures match = 13% impact
pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade (all) - 6 runs, 83% failed, 20% of failures match = 17% impact
pull-ci-openshift-machine-config-operator-master-e2e-agnostic-upgrade (all) - 31 runs, 100% failed, 3% of failures match = 3% impact
release-openshift-okd-installer-e2e-aws-upgrade (all) - 8 runs, 75% failed, 17% of failures match = 13% impact
release-openshift-origin-installer-e2e-aws-upgrade-4.6-to-4.7-to-4.8-to-4.9-ci (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
Those impact percentages are just matches; this particular test-case is non-fatal.
The Available=False conditions also lack a 'reason', although they do contain a 'message', which is the same state we had back when I'd filed bug 1948088. Maybe we can pass through the Degraded reason around [4]?
Going back to the run in [1], the Degraded condition had a few minutes at RenderConfigFailed, while [4] only has a carve out for RequiredPools. And then the Degraded condition went back to False, but for reasons I don't understand we remained Available=False until 22:33, when the MCO declared its portion of the update complete:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1490071797725401088/artifacts/e2e-aws-upgrade/openshift-e2e-test/artifacts/e2e.log | grep 'clusteroperator/machine-config '
Feb 05 22:15:40.029 E clusteroperator/machine-config condition/Degraded status/True reason/RenderConfigFailed changed: Failed to resync 4.11.0-0.nightly-2022-02-05-152519 because: refusing to read images.json version "4.11.0-0.nightly-2022-02-05-211325", operator version "4.11.0-0.nightly-2022-02-05-152519"
Feb 05 22:15:40.029 - 147s E clusteroperator/machine-config condition/Degraded status/True reason/Failed to resync 4.11.0-0.nightly-2022-02-05-152519 because: refusing to read images.json version "4.11.0-0.nightly-2022-02-05-211325", operator version "4.11.0-0.nightly-2022-02-05-152519"
Feb 05 22:15:40.430 E clusteroperator/machine-config condition/Available status/False changed: Cluster not available for [
]
Feb 05 22:18:07.150 W clusteroperator/machine-config condition/Progressing status/True changed: Working towards 4.11.0-0.nightly-2022-02-05-211325
Feb 05 22:18:07.150 - 898s W clusteroperator/machine-config condition/Progressing status/True reason/Working towards 4.11.0-0.nightly-2022-02-05-211325
Feb 05 22:18:07.178 W clusteroperator/machine-config condition/Degraded status/False changed:
Feb 05 22:18:21.505 W clusteroperator/machine-config condition/Upgradeable status/False reason/PoolUpdating changed: One or more machine config pools are updating, please see `oc get mcp` for further details
Feb 05 22:33:04.574 W clusteroperator/machine-config condition/Available status/True changed: Cluster has deployed [
]
Feb 05 22:33:04.584 W clusteroperator/machine-config condition/Upgradeable status/True changed:
Feb 05 22:33:04.931 I clusteroperator/machine-config versions: operator 4.11.0-0.nightly-2022-02-05-152519 -> 4.11.0-0.nightly-2022-02-05-211325
Feb 05 22:33:05.531 W clusteroperator/machine-config condition/Progressing status/False changed: Cluster version is 4.11.0-0.nightly-2022-02-05-211325
[bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Available
[bz-Machine Config Operator] clusteroperator/machine-config should not change condition/Degraded
[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-aws-upgrade/1490071797725401088
[2]: https://github.com/openshift/cluster-version-operator/blob/06ec265e3a3bf47b599e56aec038022edbe8b5bb/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L79-L87
[3]: https://github.com/openshift/cluster-version-operator/pull/643
[4]: https://github.com/openshift/machine-config-operator/blob/2add8f323f396a2063257fc283f8eed9038ea0cd/pkg/operator/status.go#L122-L126
add OwnerReferences to MCN ObjectMeta so that it gets garbage collected.
Description of problem:
MCO taking too much time to update the node count for MCP when removing labels from node which MCP uses to match with nodes
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Remove `node-role.kubernetes.io/worker=` label from any worker node. ~~~ # oc label node worker-0.sharedocp4upi411ovn.lab.upshift.rdu2.redhat.com node-role.kubernetes.io/worker- ~~~ 2. Check MCP worker for correct node count. ~~~ # oc get mcp worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-6916abae250ad092875791f8297c13e1 True False False 3 3 3 0 5d7h ~~~ 3. Check after 10-15 mins ~~~ # oc get mcp worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-6916abae250ad092875791f8297c13e1 True False False 2 2 2 0 5d7h ~~~
Actual results:
It took 10-15 mins for MCP to detect node removal.
Expected results:
It should detect node removal as soon as the appropriate label from the node gets missing.
Additional info:
Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift
prerequisite work Goals completed in OCPSTRAT-1122
{}Complete the design of the Cluster API (CAPI) architecture and build the core operator logic needed for Phase-1, incorporating the assets from different repositories to simplify asset management.
Phase 1 & 2 covers implementing base functionality for CAPI.
There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open
sets up CAPI ecosystem for vSphere
So far we haven't tested this provider at all. We have to run it and spot if there are any issues with it.
Steps:
Outcome:
Create an Installer RHEL9-based build for FIPS-enabled OpenShift installations
As a user, I want to enable FIPS while deploying OpenShift on any platform that supports this standard, so the resultant cluster is compliant with FIPS security standards
Provide a dynamically linked build of the Installer for RHEL 9 in the release payload
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | both |
Classic (standalone cluster) | yes |
Hosted control planes | n/a |
Multi node, Compact (three node), or Single node (SNO), or all | all |
Connected / Restricted Network | all |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | all |
Operator compatibility | |
Backport needed (list applicable versions) | no |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | OCM |
Other (please specify) |
Docs will need to guide the Installer binary to use for FIPS-enabled clusters
As a user with no FIPS requirement, I want to be able to use the same openshift-installer binary on both RHEL 8 and RHEL 9, as well as other common Linux distributions.
libvirt is not a supported platform for openshift-installer. Nonetheless, it appears in the platforms list (likely inadvertently?) when running the openshift-baremetal-installer binary because the code for it was enabled in order to link against libvirt.
Now that linking against libvirt is no longer required, there is no reason to continue shipping this unsupported code.
We will need to come up with a separate build tag to distinguish between the openshift-baremetal-install (dynamic) and openshift-install (static) builds. Currently these are distinguished by the libvirt tag.
As a user, I want to know how to download and use the correct installer binary to install a cluster with FIPS mode enabled. If I use the wrong binary or don't have FIPS enabled, I need instructions at the point I am trying to create a FIPS-mode cluster.
Currently to use baremetal IPI, a user must retrieve the openshift-baremetal-installer binary from the release payload. Historically, this was due to it needing to dynamically link to libvirt. This is no longer the case, so we can make baremetal IPI available in the standard openshift-installer binary.
Continue scale testing and performance improvements for ovn-kubernetes
Networking Definition of Planned
Epic Template descriptions and documentation
Manage Openshift Virtual Machines IP addresses from within the SDN solution provided by OVN-Kubernetes.
Customers want to offload IPAM from their custom solutions (e.g. custom DHCP server running on their cluster network) to SDN.
Additional information on each of the above items can be found here: Networking Definition of Planned
...
1.
...
1. …
1. …
Theme: Ensure 4.12 SD is as stable as 4.13 SD. See what all are present in 4.14/4.13 that are missing in 4.12 from OVNK pov
We need to come up with a KCS article for 4.12/4.13 around network policies issues. Some things it should cover are:
Check the existing network policies used by SD MCs and review them to see they are efficient
Talk about how the new OVN 23.06 will fix the except block issue and if we need to backport those port range fixes then yes that too
Goal: End result should be a document and backports if needed outside of the OVN bump planned as part of https://issues.redhat.com/browse/OCPBUGS-22091
using source port group instead of address set will decrease the number of ovs flows per node.
Needs to be backported to 4.14
Goal:
Support enablement of dual-stack VIPs on existing clusters created as dual-stack but at a time when it was not possible to have both v4 and v6 VIPs at the same time.
Why is this important?
This is a followup to SDN-2213 ("Support dual ipv4 and ipv6 ingress and api VIPs").
We expect that customers with existing dual stack clusters will want to make use of the new dual stack VIPs fixes/enablement, but it's unclear how this will work because we've never supported modifying on-prem networking configuration after initial deployment. Once we have dual stack VIPs enabled, we will need to investigate how to alter the configuration to add VIPs to an existing cluster.
We will need to make changes to the VIP fields in the Infrastructure and/or ControllerConfig objects. Infrastructure would be the first option since that would make all of the fields consistent, but that relies on the ability to change that object and have the changes persist and be propagated to the ControllerConfig. If that's not possible, we may need to make changes just in ControllerConfig.
For epics https://issues.redhat.com/browse/OPNET-14 and https://issues.redhat.com/browse/OPNET-80 we need a mechanism to change configuration values related to our static pods. Today that is not possible because all of the values are put in the status field of the Infrastructure object.
We had previously discussed this as part of https://issues.redhat.com/browse/OPNET-21 because there was speculation that people would want to move from internal LB to external, which would require mutating a value in Infrastructure. In fact, there was a proposal to put that value in the spec directly and skip the status field entirely, but that was discarded because a migration would be needed in that case and we need separate fields to indicate what was requested and what the current state actually is.
There was some followup discussion about that with Joel Speed from the API team (which unfortunately I have not been able to find a record of yet) where it was concluded that if/when we want to modify Infrastructure values we would add them to the Infrastructure spec and when a value was changed it would trigger a reconfiguration of the affected services, after which the status would be updated.
This means we will need new logic in MCO to look at the spec field (currently there are only fields in the status, so spec is ignored completely) and determine the correct behavior when they do not match. This will mean the values in ControllerConfig will not always match those in Infrastructure.Status. That's about as far as the design has gone so far, but we should keep the three use cases we know of (internal/external LB, VIP addition, and DNS record overrides) in mind as we design the underlying functionality to allow mutation of Infrastructure status values.
Depending on how the design works out, we may only track the design phase in this epic and do the implementation as part of one of the other epics. If there is common logic that is needed by all and can be implemented independently we could do that under this epic though.
Tasks to do here
The Agent Based installer is a clean and simple way to install new instances of OpenShift in disconnected environments, guiding the user through the questions and information needed to successfully install an OpenShift cluster. We need to bring this highly useful feature to the IBM Power and IBM zSystem architectures
Agent based installer on Power and zSystems should reflect what is available for x86 today.
Able to use the agent based installer to create OpenShift clusters on Power and zSystem architectures in disconnected environments
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Epic Goal
Why is this important?
Scenarios
1. …
Acceptance Criteria
Dependencies (internal and external)
1. …
Previous Work (Optional):
1. …
Open questions::
1. …
Done Checklist
Enable openshift-install to create agent based install ISO for power.
As the multi-arch engineer, I would like to build an environment and deploy using Agent Based installer, so that I can confirm if the feature works per spec.
Acceptance Criteria
Feature Overview
This is a TechDebt and doesn't impact OpenShift Users.
As the autoscaler has become a key feature of OpenShift, there is the requirement to continue to expand it's use bringing all the features to all the cloud platforms and contributing to the community upstream. This feature is to track the initiatives associated with the Autoscaler in OpenShift.
Goals
Requirements
Requirement | Notes | isMvp? |
---|---|---|
vSphere autoscaling from zero | No | |
Upstream E2E testing | No | |
Upstream adapt scale from zero replicas | No | |
Out of Scope
n/a
Background, and strategic fit
Autoscaling is a key benefit of the Machine API and should be made available on all providers
Assumptions
Customer Considerations
Documentation Considerations
please note, the changes described by this epic will happen in OpenShift controllers and as such there is no "upstream" relationship in the same sense as the Kubernetes-based controllers.
As a user I want to ensure that scale from zero cluster autoscaling works well when using the upstream scaling hint annotations so that I can follow the community best practices. Having the cluster autoscaler operator monitor the scale from zero annotations, and correct them when incorrect, will confirm the correct behavior.
As part of migrating the OpenShift scale from zero annotations to use the upstream annotations keys, the cluster autoscaler operator should be updated to look for these annotations on MachineSets that it is monitoring.
Currently, we use annotations with prefix "machinie.openshift.io", in the upstream the prefix is "capacity.cluster-autoscaler.kubernetes.io". The CAO should be updated to recognize when a MachineSet has either set of annotations, and then ensure that both sets exist.
Adding both sets of annotations will help us during the transition to using the upstream set, and will also ensure backward compatibility with our published API.
Please note that care must be taken with the suffixes as well. Some of the OpenShift suffixes are different from upstream, and in specific the memory suffix uses a different type of calculation. As we convert our autoscaler implementation to use the upstream annotations we must make sure that any conversions will conform to upstream.
<Describes the context or background related to this story>
As a developer, in order to deprecate the old annotations, we will need to carry both for at least one release cycle. Updating the CAO to apply the upstream annotations, and the CAS to accept both (preferring upstream), will allow me to properly deprecate the old annotations.
During the process of making the CAO recognize the annotations, we need to enable it to modify the machineset to have the new annotation. Similarly, we want the autoscaler to recognize both sets of annotations in the short term while we switch.
As a developer I want to have a consistent way to apply the scale from zero annotations so that it is easier to update the various provider machineset actuators. Having a utility module in the MAO will make this easier by providing a single place for all the MachineSet actuators to share.
Currently the individual provider MachineSet actuators each contain string variables and independent implementations of the scale from zero annotations. This configuration is more brittle than having a central module which could be utilized by all the providers.
The goal of this initiative to help boost adoption of OpenShift on ppc64le. This can be further broken down into several key objectives.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Epic Goal
Running doc to describe terminologies and concepts which are specific to Power VS - https://docs.google.com/document/d/1Kgezv21VsixDyYcbfvxZxKNwszRK6GYKBiTTpEUubqw/edit?usp=sharing
Flag powervs-provider-id-fmt is being deprecated and removed in upstream via PR: https://github.com/kubernetes-sigs/cluster-api-provider-ibmcloud/pull/1404.
Need to make necessary changes to use flag provider-id-fmt.
This epic is another epic under the "reduce workload disruptions" umbrella.
This is now updated to get us most of the way to MCO-200 (Admin-Defined reboot & drain), but not necessarily with all the final features in place.
This epic aims to create a reboot/drain policy object and a MCO-management apparatus for initial functionality with MachineConfig backed updates, with a restricted set of actions for the user. We also need reboot/drain policy object for ImageContentSourcePolicy, ImageTagMirrorSet and ImageDigestMirrorSet to avoid drains/reboots when admins use these APIs and have other ways of ensuring image integrity.,
This mostly focuses on the user interface for defining reboot/drain policies. We will also need this for the layering "live apply" cases and bifrost-backed updates, to be implemented into a future update.
The MCO's reboot and drain rules are currently hard-coded in the machine-config-daemon here.
Node drains also occur even beyond OCP 4.9 when not just adding but also removing ICSP, ITMS, IDMS objects or single mirroring rules in their configuratuion according to RFE-3667.
This causes at least three problems:
Done when:
Description of problem:
The MCO logic today allows users to not reboot when changing the registries.conf file (through ICSP/IDMS/ITMS objects), but the MCO will sometimes drain the node if the change is deemed "unsafe" (deleting a mirror, for example). This behaviour is very disruptive for some customers who with to make all image registries changes non-disruptive. We will address this long term with admin defined policies via the API properly, but we would like to have a backportable solution (as a support exception) for users to do so
Version-Release number of selected component (if applicable):
4.14->4.16
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Have the MCC validate the correctness of user-provided spec, and render the final object into the status for the daemon to use
Image and artifact signing is a key part of a DevSecOps model. The Red Hat-sponsored sigstore project aims to simplify signing of cloud-native artifacts and sees increasing interest and uptake in the Kubernetes community. This document proposes to incrementally invest in OpenShift support for sigstore-style signed images and be public about it. The goal is to give customers a practical and scalable way to establish content trust. It will strengthen OpenShift’s security philosophy and value-add in the light of the recent supply chain security crisis.
CRIO
https://docs.google.com/document/d/12ttMgYdM6A7-IAPTza59-y2ryVG-UUHt-LYvLw4Xmq8/edit#
Place holder epic to capture all azure tickets.
TODO: review.
Description of problem:
The cloud network config controller never initializes on Azure HostedClusters. This behaviors exists on both Arm and x86 Azure mgmt clusters.
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1. Create either an Arm or x86 Azure mgmt cluster 2. Install HO 3. Create a HostedCluster 4. Observe CNCC pod doesn't initialize
Actual results:
CNCC pod doesn't initialize
Expected results:
CNCC pod initializes
Additional info:
It looks like a secret isn't being reconciled to the CPO % oc describe pod/cloud-network-config-controller-96567b45f-7jkl5 -n clusters-brcox-hypershift-arm Name: cloud-network-config-controller-96567b45f-7jkl5 ... Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 5m46s default-scheduler Successfully assigned clusters-brcox-hypershift-arm/cloud-network-config-controller-96567b45f-7jkl5 to ci-ln-vmb5w8k-1d09d-jr6m6-worker-centralus3-45dg5 Warning FailedMount 101s (x2 over 3m44s) kubelet Unable to attach or mount volumes: unmounted volumes=[cloud-provider-secret], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition Warning FailedMount 97s (x10 over 5m47s) kubelet MountVolume.SetUp failed for volume "cloud-provider-secret" : secret "cloud-network-config-controller-creds" not found
As of OpenShift 4.14, this functionality is Tech Preview for all platforms but OpenStack, where it is GA. This Feature is to bring the functionality to GA for all remaining platforms.
Allow to configure control plane nodes across multiple subnets for on-premise IPI deployments. With separating nodes in subnets, also allow using an external load balancer, instead of the built-in (keepalived/haproxy) that the IPI workflow installs, so that the customer can configure their own load balancer with the ingress and API VIPs pointing to nodes in the separate subnets.
I want to install OpenShift with IPI on an on-premise platform (high priority for bare metal and vSphere) and I need to distribute my control plane and nodes across multiple subnets.
I want to use IPI automation but I will configure an external load balancer for the API and Ingress VIPs, instead of using the built-in keepalived/haproxy-based load balancer that come with the on-prem platforms.
Customers require using multiple logical availability zones to define their architecture and topology for their datacenter. OpenShift clusters are expected to fit in this architecture for the high availability and disaster recovery plans of their datacenters.
Customers want the benefits of IPI and automated installations (and avoid UPI) and at the same time when they expect high traffic in their workloads they will design their clusters with external load balancers that will have the VIPs of the OpenShift clusters.
Load balancers can distribute incoming traffic across multiple subnets, which is something our built-in load balancers aren't able to do and which represents a big limitation for the topologies customers are designing.
While this is possible with IPI AWS, this isn't available with on-premise platforms installed with IPI (for the control plane nodes specifically), and customers see this as a gap in OpenShift for on-premise platforms.
Epic | Control Plane with Multiple Subnets | Compute with Multiple Subnets | Doesn't need external LB | Built-in LB |
---|---|---|---|---|
NE-1069 (all-platforms) | ✓ | ✓ | ✓ | ✓ |
NE-905 (all-platforms) | ✓ | ✓ | ✓ | ✕ |
✓ | ✓ | ✓ | ✓ | |
✓ | ✓ | ✓ | ✓ | |
✓ | ✓ | ✓ | ||
✓ | ✓ | ✓ | ✕ | |
NE-905 (all platforms) | ✓ | ✓ | ✓ | ✕ |
✓ | ✓ | ✓ | ✓ | |
✕ | ✓ | ✓ | ✓ | |
✕ | ✓ | ✓ | ✓ | |
✕ | ✓ | ✓ | ✓ |
Workers on separate subnets with IPI documentation
We can already deploy compute nodes on separate subnets by preventing the built-in LBs from running on the compute nodes. This is documented for bare metal only for the Remote Worker Nodes use case: https://docs.openshift.com/container-platform/4.11/installing/installing_bare_metal_ipi/ipi-install-installation-workflow.html#configure-network-components-to-run-on-the-control-plane_ipi-install-installation-workflow
This procedure works on vSphere too, albeit no QE CI and not documented.
External load balancer with IPI documentation
Currently o/installer validates that ELB is used only in TechPreview clusters. This validation needs to be removed so that ELB can be consumed from GA.
Upstream K8s deprecated PodSecurityPolicy and replaced it with a new built-in admission controller that enforces the Pod Security Standards (See here for the motivations for deprecation).] There is an OpenShift-specific dedicated pod admission system called Security Context Constraints. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission.
With OpenShift 4.11, we are turned on the Pod Security Admission with global "privileged" enforcement. Additionally we set the "restricted" profile for warnings and audit. This configuration made it possible for users to opt-in their namespaces to Pod Security Admission with the per-namespace labels. We also introduced a new mechanism that automatically synchronizes the Pod Security Admission "warn" and "audit" labels.
With OpenShift 4.15, we intend to move the global configuration to enforce the "restricted" pod security profile globally. With this change, the label synchronization mechanism will also switch into a mode where it synchronizes the "enforce" Pod Security Admission label rather than the "audit" and "warn".
Epic Goal
Get Pod Security admission to be run in "restricted" mode globally by default alongside with SCC admission.
When creating a custom SCC, it is possible to assign a priority that is higher than existing SCCs. This means that any SA with access to all SCCs might use the higher priority custom SCC, and this might mutate a workload in an unexpected/unintended way.
To protect platform workloads from such an effect (which, combined with PSa, might result in rejecting the workload once we start enforcing the "restricted" profile) we must pin the required SCC to all workloads in platform namespaces (openshift-, kube-, default).
Each workload should pin the SCC with the least-privilege, except workloads in runlevel 0 namespaces that should pin the "privileged" SCC (SCC admission is not enabled on these namespaces, but we should pin an SCC for tracking purposes).
The following table tracks progress:
namespace | in review | merged |
---|---|---|
openshift-apiserver-operator | PR | |
openshift-authentication | PR | |
openshift-authentication-operator | PR | |
openshift-catalogd | PR | |
openshift-cloud-controller-manager | ||
openshift-cloud-controller-manager-operator | ||
openshift-cloud-credential-operator | PR | |
openshift-cloud-network-config-controller | PR | |
openshift-cluster-csi-drivers | PR1, PR2 | |
openshift-cluster-machine-approver | ||
openshift-cluster-node-tuning-operator | PR | |
openshift-cluster-olm-operator | PR | |
openshift-cluster-samples-operator | PR | |
openshift-cluster-storage-operator | PR1, PR2 | |
openshift-cluster-version | PR | |
openshift-config-operator | PR | |
openshift-console | PR | |
openshift-console-operator | PR | |
openshift-controller-manager | PR | |
openshift-controller-manager-operator | PR | |
openshift-dns | ||
openshift-dns-operator | ||
openshift-etcd | ||
openshift-etcd-operator | ||
openshift-image-registry | PR | |
openshift-ingress | PR | |
openshift-ingress-canary | PR | |
openshift-ingress-operator | PR | |
openshift-insights | PR | |
openshift-kube-apiserver | ||
openshift-kube-apiserver-operator | ||
openshift-kube-controller-manager | ||
openshift-kube-controller-manager-operator | ||
openshift-kube-scheduler | ||
openshift-kube-scheduler-operator | ||
openshift-kube-storage-version-migrator | PR | |
openshift-kube-storage-version-migrator-operator | PR | |
openshift-machine-api | PR1, PR2, PR3, PR4, PR5, PR6 | |
openshift-machine-config-operator | PR | |
openshift-marketplace | PR | |
openshift-monitoring | PR | |
openshift-multus | ||
openshift-network-diagnostics | PR | |
openshift-network-node-identity | PR | |
openshift-network-operator | ||
openshift-oauth-apiserver | PR | |
openshift-operator-controller | PR | |
openshift-operator-lifecycle-manager | PR | |
openshift-ovn-kubernetes | ||
openshift-route-controller-manager | PR | |
openshift-service-ca | PR | |
openshift-service-ca-operator | PR | |
openshift-user-workload-monitoring | PR |
To be broken into one feature epic and a spike:
The MCO today has multiple layers of errors. There are generally speaking 4 locations where an error message can appear, from highest to lowest:
The error propagation is generally speaking not 1-to-1. The operator status will generally capture the pool status, but the full error from Controller/Daemon does not fully bubble up to pool/operator, and the journal logs with error generally don’t get bubbled up at all. This is very confusing for customers/admins working with the MCO without full understanding of the MCO’s internal mechanics:
Using “unexpected on-disk state” as an example, this can be caused by any amount of the following:
Etc. etc.
Since error use cases are wide and varied, there are many improvements we can perform for each individual error state. This epic aims to propose targeted improvements to error messaging and propagation specifically. The goals being:
With a side objective of observability, including reporting all the way to the operator status items such as:
Approaches can include:
It became clear overtime that we need to enhance most of the MCO metrics that we have as well as adding more related to the MCC. The MCC is tasked with watching what's going on with pools and it makes sense to add more metrics and alerting especially there. There are various hiccups with metrics that we've been and are going through. This epic aims at addressing those and start working on adding more useful metrics/alerting to the MCO. Another aim for this epic would be (but we can split it out) to provide more data to help us proactively debug clusters when things go wrong.
After spiking, the work for metric enhancement is split into the following way:
Add the following types of metrics in the proper places in the MCO:
This will involve registering a new metrics and making sure that it is updated when key events occur
Description of problem:
When querying up for registered metrics in Console/Observe/Metrics, certain metrics are not showing up (return "No datapoints found"). These metrics include mcc_drain_err, mcc_state, etc
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. start up a cluster 2. run query e.g. mcc_drain_err in Console/Observe/Metrics
Actual results:
No datapoints found
Expected results:
Numerical result
Additional info:
This only happens to metrics that are defined / registed with the type gaugeVec. It is discovered that for any vec (gaugevec, countervec), an initialization is needed, otherwise, it will not show up until updates: https://github.com/thought-machine/please-servers/pull/258
Overarching Goal
Move to using the upstream Cluster API (CAPI) in place of the current implementation of the Machine API for standalone Openshift.
Phase 1 & 2 covers implementing base functionality for CAPI.
Phase 2 also covers migrating MAPI resources to CAPI.
Phase 2 Goal:
for Phase-1, incorporating the assets from different repositories to simplify asset management.
There must be no negative effect to customers/users of the MAPI, this API must continue to be accessible to them though how it is implemented "under the covers" and if that implementation leverages CAPI is open
Description of problem:
I would like to use OpenShift clusters as management clusters to create other guest clusters. This is not currently possible because of restrictions imposed by the validating webhook for the Cluster object which prevents Cluster objects being deleted. The webhook _should_ only apply to the `openshift-cluster-api` namespace, but presently will apply to all namespaces.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Create a Cluster object in a namespace that isn't openshift-cluster-api 2. Attempt to delete the Cluster object you just created 3.
Actual results:
Error, cannot delete Cluster
Expected results:
Cluster object deleted successfully
Additional info:
Description of problem:
Cluster API expects the user data secret created by the installer to contain a `format: ignition` value. This is used by the various providers to identify that they should treat the user data as ignition. We do not currently set this value, but should be. This may cause issues with certain providers that expect ignition to be uploaded to blob storage, we should identify the behaviour of the providers we care about (AWS, Azure, GCP, vSphere) and ensure that they behave well and continue to work with this change. (This may mean upstream changes to only upload large ignitions to the storage/make the storage optional)
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
As an OpenShift on vSphere administrator, I want to specify static IP assignments to my VMs.
As an OpenShift on vSphere administrator, I want to completely avoid using a DHCP server for the VMs of my OpenShift cluster.
Customers want the convenience of IPI deployments for vSphere without having to use DHCP. As in bare metal, where METAL-1 added this capability, some of the reasons are the security implications of DHCP (customers report that for example depending on configuration they allow any device to get in the network). At the same time IPI deployments only require to our OpenShift installation software, while with UPI they would need automation software that in secure environments they would have to certify along with OpenShift.
Bare metal related work:
CoreOS Afterburn:
https://github.com/coreos/afterburn/blob/main/src/providers/vmware/amd64.rs#L28
https://github.com/openshift/installer/blob/master/upi/vsphere/vm/main.tf#L34
As an OpenShift on vSphere administrator, I want to specify static IP assignments to my VMs.
As an OpenShift on vSphere administrator, I want to completely avoid using a DHCP server for the VMs of my OpenShift cluster.
Customers want the convenience of IPI deployments for vSphere without having to use DHCP. As in bare metal, where METAL-1 added this capability, some of the reasons are the security implications of DHCP (customers report that for example depending on configuration they allow any device to get in the network). At the same time IPI deployments only require to our OpenShift installation software, while with UPI they would need automation software that in secure environments they would have to certify along with OpenShift.
Bare metal related work:
CoreOS Afterburn:
https://github.com/coreos/afterburn/blob/main/src/providers/vmware/amd64.rs#L28
https://github.com/openshift/installer/blob/master/upi/vsphere/vm/main.tf#L34
<!--
Please make sure to fill all story details here with enough information so
that it can be properly sized and is immediately actionable. Our Definition
of Ready for user stories is detailed in the link below:
https://docs.google.com/document/d/1Ps9hWl6ymuLOAhX_-usLmZIP4pQ8PWO15tMksh0Lb_A/
As much as possible, make sure this story represents a small chunk of work
that could be delivered within a sprint. If not, consider the possibility
of splitting it or turning it into an epic with smaller related stories.
Before submitting it, please make sure to remove all comments like this one.
-->
{}USER STORY:{}
As a cluster admin, the status field relating to the IPAddressClaimed is a bit confusing and should be improved to make it better to understand.
{}DESCRIPTION:{}
Currently the machine object has the following status:
providerStatus: conditions: - lastTransitionTime: "2023-09-04T17:50:34Z" message: All IP address claims are bound reason: WaitingForIPAddress status: "False" type: IPAddressClaimed
The reason, status and message are a bit confusing when IP address claim is bound. The above is an example of what it says when it is finished.
{}ACCEPTANCE CRITERIA:{}
The status should look something like the following when IP is claimed:
conditions: - lastTransitionTime: "2023-09-06T13:52:51Z" message: All IP address claims are bound reason: IPAddressesClaimed status: "True" type: IPAddressClaimed
reason text may change to match other condition fields formatting.
{}ENGINEERING DETAILS:{}
This most likely will involve updating a few different projects:
{}USER STORY:{}
As a system admin, I would like the static IP support for vSphere to use IPAddressClaims to provide IP address during installation so that after the install, the machines are defined in a way that is intended for use with IPAM controllers.
{}DESCRIPTION:{}
Currently the installer for vSphere will directly set the static IPs into the machine object yaml files. We would like to enhance the installer to create IPAddress, IPAddressClaim for each machine as well as update the machinesets to use addressesFromPools to request the IPAddress. Also, we should create a custom CRD that is the basis for the pool defined in the addressesFromPools field.
{}ACCEPTANCE CRITERIA:{}
After installing static IP for vSphere IPI, the cluster should contain machines, machinesets, crd, ipaddresses and ipaddressclaims related to static IP assignment.
{}ENGINEERING DETAILS:{}
These changes should all be contained in the installer project. We will need to be sure to cover static IP for zonal and non-zonal installs. Additionally, we need to have this work for all control-plane and compute machines.
Done when:
We have an enhancement drafted and socialized that
Should be reviewed by/contain provisions for
Older clusters updating into or running 4.15.0-rc.0 (and possibly Engineering Candidates?) can have the Kube API server operator initiate certificate rollouts, including the api-int CA. Missing pieces in the pipeline to roll out the new CA to kubelets and other consumers lead the cluster to lock up when the Kubernetes API servers transition to using the new cert/CA pair when serving incoming requests. For example, nodes may go NotReady with kubelets unable to call in their status to an api-int signed by the new CA that they don't yet trust.
Seen in two updates from 4.14.6 to 4.15.0-rc0. Unclear if Engineering Candidates were also exposed. 4.15.0-rc.1 and later will not be exposed because they have the fix for OCPBUGS-18761. They may still have broken logic for these CA rotations in place, but until the certs are 8y or more old, they will not trigger that broken logic.
We're working on it. Maybe cluster-kube-apiserver-operator#1615.
Nodes go NotReady with kubelet failing to communicate with api-int because of tls: failed to verify certificate: x509: certificate signed by unknown authority.
Happy certificate rollout.
Rolling the api-int CA is complicated, and we seem to be missing a number of steps. It's probably worth working out details in a GDoc or something where we have a shared space to fill out the picture.
One piece is getting the api-int certificates out to the kubelet, where the flow seems to be:
That handles new-node creation, but not "Kube API-server operator rolled the CA, and now we need to update existing nodes, and systemctl status restart their kubelets. And any pods using ServiceAccount kubeconfigs? And...?". This bug is about filling in those missing pieces in the cert-rolling pipeline (including having the Kube API server not use the new CA until it has been sufficiently rolled out to api-int clients, possibly including every ServiceAccount-consuming pod on the cluster?), and anything else that seems broken with the early cert-rolls.
Somewhat relevant here is OCPBUGS-15367 currently managing /etc/kubernetes/kubeconfig permissions in the machine-config daemon to backstop for the file existing in the MCS-served Ignition config but not being a part of the rendered MachineConfig or the ControllerConfig stack.
Add authorization to the internal components of the Agent Installer so that the cluster install is secure.
Requirements
Are there any requirements specific to the auth token?
Actors:
Do we need more than one auth scheme?
Agent-admin - agent-read-write
Agent-user - agent-read
Options for Implementation:
As a user, when running agent create image, agent create pxe-files and agent create config iso commands, I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This part of the overall multiple release Composable OpenShift (OCPPLAN-9638 effort), which is being delivered in multiple phases:
Phase 1 (OpenShift 4.11): OCPPLAN-7589 Provide a way with CVO to allow disabling and enabling of operators
Phase 2 (OpenShift 4.12): OCPPLAN-7589 Provide a way with CVO to allow disabling and enabling of operators
Phase 3 (OpenShift 4.13): OCPBU-117
Phase 4 (OpenShift 4.14): OCPSTRAT-36 (formerly OCPBU-236)
Phase 4 (OpenShift 4.14): OCPSTRAT-36 (formerly OCPBU-236)
Phase 5 (OpenShift 4.15): OCPSTRAT-421 (formerly OCPBU-519)
Phase 6 (OpenShift 4.16): OCPSTRAT-731
Phase 7 (OpenShift 4.17): OCPSTRAT-1308
Questions to be addressed:
In the initial delivery of CoreOS Layering, it is required that administrators provide their own build environment to customize RHCOS images. That could be a traditional RHEL environment or potentially an enterprising administrator with some knowledge of OCP Builds could set theirs up on-cluster.
The primary virtue of an on-cluster build path is to continue using the cluster to manage the cluster. No external dependency, batteries-included.
On-cluster, automated RHCOS Layering builds are important for multiple reasons:
This work describes the tech preview state of On Cluster Builds. Major interfaces should be agreed upon at the end of this state.
Description of problem:
In a cluster with a pool using OCB functionality, if we update the imageBuilderType value while an openshift-image-builder pod is building an image, the build fails. It can fail in 2 ways: 1. Removing the running pod that is building the image, and what we get is a failed build reporting "Error (BuildPodDeleted)" 2. The machine-os-builder pod is restarted but the build pod is not removed. Then the build is never removed.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-09-12-195514 True False 154m Cluster version is 4.14.0-0.nightly-2023-09-12-195514
How reproducible:
Steps to Reproduce:
1. Create the needed resources to make OCB functionality work (on-cluster-build-config configmap, the secrets and the imageSpec) We reproduced it using imageBuilderType="" oc patch cm/on-cluster-build-config -n openshift-machine-config-operator -p '{"data":{"imageBuilderType": ""}}' 2. Create an infra pool and label it so that it can use OCB functionality apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: infra spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]} nodeSelector: matchLabels: node-role.kubernetes.io/infra: "" oc label mcp/infra machineconfiguration.openshift.io/layering-enabled= 3. Wait until the triggered build has finished. 4. Create a new MC to trigger a new build. This one, for example: kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: test-machine-config spec: config: ignition: version: 3.1.0 storage: files: - contents: source: data:text/plain;charset=utf-8;base64,dGVzdA== filesystem: root mode: 420 path: /etc/test-file.test 5. Just after a new build pod is created, configure the on-cluster-build-config configmap to use the "custom-pod-builder" imageBuilderType oc patch cm/on-cluster-build-config -n openshift-machine-config-operator -p '{"data":{"imageBuilderType": "custom-pod-builder"}}'
Actual results:
We have observed 2 behaviors after step 5: 1. The machine-os-builder pod is restarted and the build is never removed. build.build.openshift.io/build-rendered-infra-b2473d404d9ddfa1536d2fb32b54d855 Docker Dockerfile Running 10 seconds ago NAME READY STATUS RESTARTS AGE pod/build-rendered-infra-b2473d404d9ddfa1536d2fb32b54d855-build 1/1 Running 0 12s pod/machine-config-controller-5bdd7b66c5-dl4hh 2/2 Running 0 90m pod/machine-config-daemon-5wbw4 2/2 Running 0 90m pod/machine-config-daemon-fqr8x 2/2 Running 0 90m pod/machine-config-daemon-g77zd 2/2 Running 0 83m pod/machine-config-daemon-qzmvv 2/2 Running 0 83m pod/machine-config-daemon-w8mnz 2/2 Running 0 90m pod/machine-config-operator-7dd564556d-mqc5w 2/2 Running 0 92m pod/machine-config-server-28lnp 1/1 Running 0 89m pod/machine-config-server-5csjz 1/1 Running 0 89m pod/machine-config-server-fv4vk 1/1 Running 0 89m pod/machine-os-builder-6cfbd8d5d-2f7kd 0/1 Terminating 0 3m26s pod/machine-os-builder-6cfbd8d5d-h2ltd 0/1 ContainerCreating 0 1s NAME TYPE FROM STATUS STARTED DURATION build.build.openshift.io/build-rendered-infra-b2473d404d9ddfa1536d2fb32b54d855 Docker Dockerfile Running 12 seconds ago 2. The build pod is removed and the build fails with Error (BuildPodDeleted): NAME TYPE FROM STATUS STARTED DURATION build.build.openshift.io/build-rendered-infra-b2473d404d9ddfa1536d2fb32b54d855 Docker Dockerfile Running 10 seconds ago NAME READY STATUS RESTARTS AGE pod/build-rendered-infra-b2473d404d9ddfa1536d2fb32b54d855-build 1/1 Terminating 0 12s pod/machine-config-controller-5bdd7b66c5-dl4hh 2/2 Running 0 159m pod/machine-config-daemon-5wbw4 2/2 Running 0 159m pod/machine-config-daemon-fqr8x 2/2 Running 0 159m pod/machine-config-daemon-g77zd 2/2 Running 8 152m pod/machine-config-daemon-qzmvv 2/2 Running 16 152m pod/machine-config-daemon-w8mnz 2/2 Running 0 159m pod/machine-config-operator-7dd564556d-mqc5w 2/2 Running 0 161m pod/machine-config-server-28lnp 1/1 Running 0 159m pod/machine-config-server-5csjz 1/1 Running 0 159m pod/machine-config-server-fv4vk 1/1 Running 0 159m pod/machine-os-builder-6cfbd8d5d-g62b6 1/1 Running 0 2m11s NAME TYPE FROM STATUS STARTED DURATION build.build.openshift.io/build-rendered-infra-b2473d404d9ddfa1536d2fb32b54d855 Docker Dockerfile Running 12 seconds ago ..... NAME TYPE FROM STATUS STARTED DURATION build.build.openshift.io/build-rendered-infra-b2473d404d9ddfa1536d2fb32b54d855 Docker Dockerfile Error (BuildPodDeleted) 17 seconds ago 13s
Expected results:
Updating the imageBuilderType while a build is running should not result in the OCB functionlity in a broken status.
Additional info:
Must-gather files are provided in the first commen in this ticket.
There are a few situations in which a cluster admin might want to trigger a rebuild of their OS image in addition to situations where cluster state may dictate that we should perform a rebuild. For example, if the custom Dockerfile changes or the machine-config-osimageurl changes, it would be desirable to perform a rebuild in that case. To that end, this particular story covers adding the foundation for a rebuild mechanism in the form of an annotation that can be applied to the target MachineConfigPool. What is out of scope for this story is applying this annotation in response to a change in cluster state (e.g., custom Dockerfile change).
Done When:
Only start the buildcontroller if the tech preview feature gate is enabled.
Description of problem:
In pools with On-Cluster Build enabled. When a config drift happens because a file's content has been manually changed the MCP goes degraded (this is expected). - lastTransitionTime: "2023-08-31T11:34:33Z" message: 'Node sregidor-sr2-2gb5z-worker-a-7tpjd.c.openshift-qe.internal is reporting: "unexpected on-disk state validating against quay.io/xxx/xxx@sha256:........................: content mismatch for file \"/etc/mco-test-file\""' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded If we fix this drift and we restore the original file's content, the MCP becomes degraded with this message: - lastTransitionTime: "2023-08-31T12:24:47Z" message: 'Node sregidor-sr2-2gb5z-worker-a-q7wcb.c.openshift-qe.internal is reporting: "failed to update OS to quay.io/xxx/xxx@sha256:....... : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/xxx/xxx@sha256:........: error: Old and new refs are equal: ostree-unverified-registry:quay.io/xxx/xxx@sha256:..............\n: exit status 1"' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-08-30-191617 True False 4h18m Error while reconciling 4.14.0-0.nightly-2023-08-30-191617: the cluster operator monitoring is not available
How reproducible:
Always
Steps to Reproduce:
1. Enable the OCB functionality for worker pool $ oc label mcp/worker machineconfiguration.openshift.io/layering-enabled= (Create the necessary cms and secrets for the OCB functionality to work fine) wait until the new image is created and the nodes are updated 2. Create a MC to deploy a new file apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: mco-drift-test-file spec: config: ignition: version: 3.2.0 storage: files: - contents: source: data:,MCO%20test%20file%0A path: /etc/mco-test-file wait until the new MC is deployed 3. Modify the content of the file /etc/mco-test-file making a backup first $ oc debug node/$(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") chrWarning: metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must be no more than 63 characters] Starting pod/sregidor-sr2-2gb5z-worker-a-q7wcbcopenshift-qeinternal-debug-sv85v ... To use host binaries, run `chroot /host` oot /host cd /etc Pod IP: 10.0.128.9 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-5.1# cd /etc sh-5.1# cat mco-test-file MCO test file sh-5.1# cp mco-test-file mco-test-file-back sh-5.1# echo -n "1" >> mco-test-file 4. wait until the MCP reports the config drift issue $ oc get mcp worker -o yaml .... - lastTransitionTime: "2023-08-31T11:34:33Z" message: 'Node sregidor-sr2-2gb5z-worker-a-7tpjd.c.openshift-qe.internal is reporting: "unexpected on-disk state validating against quay.io/xxx/xxx@sha256:........................: content mismatch for file \"/etc/mco-test-file\""' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded 5. Restore the backup that we made in step 3 sh-5.1# cp mco-test-file-back mco-test-file
Actual results:
The worker pool is degraded with this message - lastTransitionTime: "2023-08-31T12:24:47Z" message: 'Node sregidor-sr2-2gb5z-worker-a-q7wcb.c.openshift-qe.internal is reporting: "failed to update OS to quay.io/xxx/xxx@sha256:....... : error running rpm-ostree rebase --experimental ostree-unverified-registry:quay.io/xxx/xxx@sha256:........: error: Old and new refs are equal: ostree-unverified-registry:quay.io/xxx/xxx@sha256:..............\n: exit status 1"' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded
Expected results:
The node pool should stop being degraded.
Additional info:
There is a link to the must-gather file in the first comment of this issue.
Description of problem:
In clusters with OCB functionality enabled, sometimes the machine-os-builder pod is not restarted when we update the imageBuilderType. What we have observed is that the pod is restarted if a build is running, but it is not restarted if we are not building anything.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-09-12-195514 True False 88m Cluster version is 4.14.0-0.nightly-2023-09-12-195514
How reproducible:
Always
Steps to Reproduce:
1. Create the configuration resources needed by the OCB functionality. To reproduce this issue we use an on-cluster-build-config configmap with an empty imageBuilderType oc patch cm/on-cluster-build-config -n openshift-machine-config-operator -p '{"data":{"imageBuilderType": ""}}' 2. Create a infra pool and label it so that it can use OCB functionality apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: infra spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]} nodeSelector: matchLabels: node-role.kubernetes.io/infra: "" oc label mcp/infra machineconfiguration.openshift.io/layering-enabled= 3. Wait for the build pod to finish. 4. Once the build has finished and it has been cleaned, update the imageBuilderType so that we use "custom-pod-builder" type now. oc patch cm/on-cluster-build-config -n openshift-machine-config-operator -p '{"data":{"imageBuilderType": "custom-pod-builder"}}'
Actual results:
We waited for one hour, but the pod is never restarted. $ oc get pods |grep build machine-os-builder-6cfbd8d5d-xk6c5 1/1 Running 0 56m $ oc logs machine-os-builder-6cfbd8d5d-xk6c5 |grep Type I0914 08:40:23.910337 1 helpers.go:330] imageBuilderType empty, defaulting to "openshift-image-builder" $ oc get cm on-cluster-build-config -o yaml |grep Type imageBuilderType: custom-pod-builder
Expected results:
When we update the imageBuilderType value, the machine-os-builder pod should be restarted.
Additional info:
Description of problem:
In OCB pools, when we create a MC to configure a password for the "core" user the password is not configured.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-08-30-191617 True False 5h38m Cluster version is 4.14.0-0.nightly-2023-08-30-191617
How reproducible:
Alwasy
Steps to Reproduce:
1. Enable on-cluster build on "worker" pool. 2. Create a MC to configure the "core" user password apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: creationTimestamp: "2023-09-01T09:51:14Z" generation: 1 labels: machineconfiguration.openshift.io/role: worker name: tc-59417-test-core-passwd-tx2ndvcd resourceVersion: "105610" uid: 1f7a4de1-6222-4153-a46c-d1a17e5f89b1 spec: config: ignition: version: 3.2.0 passwd: users: - name: core passwordHash: $6$uim4LuKWqiko1l5K$QJUwg.4lAyU4egsM7FNaNlSbuI6JfQCRufb99QuF082BpbqFoHP3WsWdZ5jCypS0veXWN1HDqO.bxUpE9aWYI1 # password coretest 3. Wait for the configuration to be built and applied
Actual results:
The password is not configured for the core user In a worker node: We can't login using the new password $ oc debug node/sregidor-sr3-bfxxj-worker-a-h5b5j.c.openshift-qe.internal Warning: metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must be no more than 63 characters] Starting pod/sregidor-sr3-bfxxj-worker-a-h5b5jcopenshift-qeinternal-debug-cb2gh ... To use host binaries, run `chroot /host` chPod IP: 10.0.128.2 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-5.1# su core [core@sregidor-sr3-bfxxj-worker-a-h5b5j /]$ su core Password: su: Authentication failure The password is not configured: sh-5.1# cat /etc/shadow |grep core systemd-coredump:!!::::::: core:*:19597:0:99999:7:::
Expected results:
The password should be configured and we should be able to login to the nodes using the user "core" and the configured password.
Additional info:
Description of problem:
In an on-cluster build pool, when we create a MC to update the sshkeys, we can't find the new keys in the nodes after the configuration is built and applied.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-08-30-191617 True False 7h52m Cluster version is 4.14.0-0.nightly-2023-08-30-191617
How reproducible:
Always
Steps to Reproduce:
1. Enable the on-cluster build functionality in the "worker" pool 2. Check the value of the current keys $ oc debug node/$(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") -- chroot /host cat /home/core/.ssh/authorized_keys.d/ignition Warning: metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must be no more than 63 characters] Starting pod/sregidor-sr3-bfxxj-worker-a-h5b5jcopenshift-qeinternal-debug-ljxgx ... To use host binaries, run `chroot /host` ssh-rsa AAAA..................................................................................................................................................................qe@redhat.com Removing debug pod ... 3. Create a new MC to configure the "core" user's sshkeys. We add 2 extra keys. $ oc get mc -o yaml tc-59426-add-ssh-key-9tv2owyp apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: creationTimestamp: "2023-09-01T10:57:14Z" generation: 1 labels: machineconfiguration.openshift.io/role: worker name: tc-59426-add-ssh-key-9tv2owyp resourceVersion: "135885" uid: 3cf31fbb-7a4e-472d-8430-0c0eb49420fc spec: config: ignition: version: 3.2.0 passwd: users: - name: core sshAuthorizedKeys: - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDPmGf/sfIYog...... mco_test@redhat.com - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDf....... mco_test2@redhat.com 3. Verify that the new rendered MC contains the 3 keys $ oc get mcp worker NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-02d04d7c47cd3e08f8f305541cf85000 True False False 2 2 2 0 8h $ oc get mc -o yaml rendered-worker-02d04d7c47cd3e08f8f305541cf85000 | grep users -A9 users: - name: core sshAuthorizedKeys: - ssh-rsa AAAAB...............................qe@redhat.com - ssh-rsa AAAAB...............................mco_test@redhat.com - ssh-rsa AAAAB...............................mco_test2@redhat.com storage:
Actual results:
Only the initial key is present in the node $ oc debug node/$(oc get nodes -l node-role.kubernetes.io/worker -ojsonpath="{.items[0].metadata.name}") -- chroot /host cat /home/core/.ssh/authorized_keys.d/ignition Warning: metadata.name: this is used in the Pod's hostname, which can result in surprising behavior; a DNS label is recommended: [must be no more than 63 characters] Starting pod/sregidor-sr3-bfxxj-worker-a-h5b5jcopenshift-qeinternal-debug-ljxgx ... To use host binaries, run `chroot /host` ssh-rsa AAAA.........qe@redhat.com Removing debug pod ...
Expected results:
The added ssh keys should be configure in /home/core/.ssh/authorized_keys.d/ignition file as well.
Additional info:
Description of problem:
When a MCP has the on-cluster-build functionality enabled, when we configure a valid imageBuilderType in the on-cluster-build configmap, and later on we update this configmap with an invalid imageBuilderType the machine-config ClusterOperator is not degraded.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-09-12-195514 True False 3h56m Cluster version is 4.14.0-0.nightly-2023-09-12-195514
How reproducible:
Always
Steps to Reproduce:
1. Create a valid OCB configmap, and 2 valid secrets. Like this: apiVersion: v1 data: baseImagePullSecretName: mco-global-pull-secret finalImagePullspec: quay.io/mcoqe/layering finalImagePushSecretName: mco-test-push-secret imageBuilderType: "" kind: ConfigMap metadata: creationTimestamp: "2023-09-13T15:10:37Z" name: on-cluster-build-config namespace: openshift-machine-config-operator resourceVersion: "131053" uid: 1e0c66de-7a9a-4787-ab98-ce987a846f66 3. Label the "worker" MCP in order to enable the OCB functionality in it. $ oc label mcp/worker machineconfiguration.openshift.io/layering-enabled= 4. Wait for the machine-os-builder pod to be created, and for the build to be finished. Just the wait for the pods, do not wait for the MCPs to be updated. As soon as the build pod has finished the build, go to step 5. 5. Patch the on-cluster-build configmap to use a valid imageBuilderType oc patch cm/on-cluster-build-config -n openshift-machine-config-operator -p '{"data":{"imageBuilderType": "fake"}}'
Actual results:
The machine-os-builder pod crashes $ oc get pods NAME READY STATUS RESTARTS AGE machine-config-controller-5bdd7b66c5-6l7sz 2/2 Running 2 (45m ago) 63m machine-config-daemon-5ttqh 2/2 Running 0 63m machine-config-daemon-l95rj 2/2 Running 0 63m machine-config-daemon-swtc6 2/2 Running 2 57m machine-config-daemon-vq594 2/2 Running 2 57m machine-config-daemon-zrf4f 2/2 Running 0 63m machine-config-operator-7dd564556d-9smk4 2/2 Running 2 (45m ago) 65m machine-config-server-9sxjv 1/1 Running 0 62m machine-config-server-m5sdl 1/1 Running 0 62m machine-config-server-zb2hr 1/1 Running 0 62m machine-os-builder-6cfbd8d5d-t6g8w 0/1 CrashLoopBackOff 6 (3m11s ago) 9m16s But the machine-config ClusterOperator is not degraded $ oc get co machine-config NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE machine-config 4.14.0-0.nightly-2023-09-12-195514 True False False 63m
Expected results:
The machine-config ClusterOperator should become degraded when an invalid imageBuilderType is configured.
Additional info:
If we configure an invalid imageBuilderType directly (not by patching/editing the configmap), then the machine-config CO is degraded, but when we edit the configmap it is not. A link to the must-gather file is provided in the first comment in this issue PS: If we wait for the MCPs to be updated in step 4, the machine-os-builder pod is not restarted with the new "fake" imageBuilderType, but the machine-config CO is not degraded either, and it should. Does it make sense?
Description of problem:
When opting into on-cluster builds on both the worker and control plane MachineConfigPools, the maxUnavailable value on the MachineConfigPools is not respected when the newly built image is rolled out to all of the nodes in a given pool.
Version-Release number of selected component (if applicable):
How reproducible:
Sometimes reproducible. I'm still working on figuring out what conditions need to be present for this to occur.
Steps to Reproduce:
1. Opt an OpenShift cluster in on-cluster builds by following these instructions: https://github.com/openshift/machine-config-operator/blob/master/docs/OnClusterBuildInstructions.md 2. Ensure that both the worker and control plane MachineConfigPools are opted in.
Actual results:
Multiple nodes in both the control plane and worker MachineConfigPools are drained and cordoned simultaneously, irrespective of the maxUnavailable value. This is particularly problematic for control plane nodes since draining more than one control plane node at a time can cause etcd issues, in addition to PDBs (Pod Disruption Budgets) which can make the config change take substantially longer or block completely. I've mostly seen this issue affect control plane nodes, but I've also seen it impact both control plane and worker nodes.
Expected results:
I would have expected the new OS image to be rolled out in a similar fashion as new MachineConfigs are rolled out. In other words, a single node (or nodes up to maxUnavailable for non-control-plane nodes) is cordoned, drained, updated, and uncordoned at a time.
Additional info:
I suspect the bug may be someplace within the NodeController since that's the part of the MCO that controls which nodes update at a given time. That said, I've had difficulty reliably reproducing this issue, so finding a root cause could be more involved. This also seems to be mostly confined to the initial opt-in process. Subsequent updates seem to follow the original "rules" more closely.
In their current state, the BuildController unit test suite sometimes fails unexpectedly. This causes loss of confidence in the MCO unit test suite and can block PRs from merging; even when the changes the PR introduces are unrelated to BuildController. I suspect there is a race condition within the test suite, which combined with the test suite itself being aggressively parallel, causes the test suite to fail unexpectedly.
Done When:
Description of problem:
MachineConfigs that use 3.4.0 ignition with a kernelArguments are not currently allowed by MCO. In on-cluster build pools, when we create a 3.4.0 MC with kernelArguments, the pool is not degraded. No new rendered MC is created either.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-09-06-065940
How reproducible:
Always
Steps to Reproduce:
1. Enable on-cluster build in the "worker" pool 2. Create a MC using 3.4.0 ignition version with kernelArguments apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: creationTimestamp: "2023-09-07T12:52:11Z" generation: 1 labels: machineconfiguration.openshift.io/role: worker name: mco-tc-66376-reject-ignition-kernel-arguments-worker resourceVersion: "175290" uid: 10b81a5f-04ee-4d7b-a995-89f319968110 spec: config: ignition: version: 3.4.0 kernelArguments: shouldExist: - enforcing=0
Actual results:
The build process is triggered and new image is built and deployed. The pool is never degraded.
Expected results:
MCs with igition 3.4.0 kernelArguments are not currently allowed. The MCP should be degraded reporting a message similar to this one (this is the error reported if we deploy the MC in the master pool, which is a normal pool): oc get mcp -o yaml .... - lastTransitionTime: "2023-09-07T12:16:55Z" message: 'Node sregidor-s10-7pdvl-master-1.c.openshift-qe.internal is reporting: "can''t reconcile config rendered-master-57e85ed95604e3de944b0532c58c385e with rendered-master-24b982c8b08ab32edc2e84e3148412a3: ignition kargs section contains changes"' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded
Additional info:
When the image is deployed (it shouldn't be deployed) the kernel argument enforcing=0 is not present: sh-5.1# cat /proc/cmdline BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-05f51fadbc7fe74fa1e2ba3c0dbd0268c6996f0582c05dc064f137e93aa68184/vmlinuz-5.14.0-284.30.1.el9_2.x86_64 ostree=/ostree/boot.0/rhcos/05f51fadbc7fe74fa1e2ba3c0dbd0268c6996f0582c05dc064f137e93aa68184/0 ignition.platform.id=gcp console=tty0 console=ttyS0,115200n8 root=UUID=95083f10-c02f-4d94-a5c9-204481ce3a91 rw rootflags=prjquota boot=UUID=0440a909-3e61-4f7c-9f8e-37fe59150665 systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all psi=1
Allow attaching an ISO image that will be used for data on an already provisioned system using a BMH.
Currently this can be achieved using the existing BMH.Spec.Image fields, but this attempts to change the boot order of the system and relies on the host to fallback to the installed system when booting the image fails.
Scope questions:
Neither logging nor node.last_error is handled.
while the basic implementation of the API is done, we need to add the functionality to the redfish driver
Add support of NAT Gateways in Azure while deploying OpenShift on this cloud to manage the outbound network traffic and make this the default option for new deployments
While deploying OpenShift on Azure the Installer will configure NAT Gateways as the default method to handle the outbound network traffic so we can prevent existing issues on SNAT Port Exhaustion issue related to the configured outboundType by default.
The installer will use the NAT Gateway object from Azure to manage the outbound traffic from OpenShift.
The installer will create a NAT Gateway object per AZ in Azure so the solution is HA.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
Using NAT Gateway for egress traffic is the recommended approach from Microsoft
This is also a common ask from different enterprise customers as with the actual solution used by OpenShift for outbound traffic management in Azure they are hitting SNAT Port Exhaustion issues.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
This work depends on the work done in CORS-2564
As a user, I want to be able to:
so that I can achieve
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Create a GCP cloud specific spec.resourceTags entry in the infrastructure CRD. This should create and update tags (or labels in GCP) on any openshift cloud resource that we create and manage. The behaviour should also tag existing resources that do not have the tags yet and once the tags in the infrastructure CRD are changed all the resources should be updated accordingly.
Tag deletes continue to be out of scope, as the customer can still have custom tags applied to the resources that we do not want to delete.
Due to the ongoing intree/out of tree split on the cloud and CSI providers, this should not apply to clusters with intree providers (!= "external").
Once confident we have all components updated, we should introduce an end2end test that makes sure we never create resources that are untagged.
Goals
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
List any affected packages or components.
Installer creates below list of gcp resources during create cluster phase and these resources should be applied with the user defined tags.
Resources List
Resource | Terraform API |
---|---|
VM Instance | google_compute_instance |
Storage Bucket | google_storage_bucket |
Acceptance Criteria:
Enhancement proposed for GCP tags support in OCP, requires machine-api-provider-gcp to add azure userTags available in the status sub resource of infrastructure CR, to the gcp virtual machine resources created.
Acceptance Criteria
ABI is using assisted installer + kube-api or any other client that communicate with the service, all the building blocks related to day2 installation exist in those components
Assisted installer can create installed cluster and use it to perform day2 operations
A doc that explains how it's done with kube-api
Parameters that are required from the user:
Actions required from the user
To keep similar flow between day1 and day2 i suggest to run the service on each node that user is trying to add, it will create the cluster definition and start the installation, after first reboot it will pull the ignition from the day1 cluster
Deploy a command to generate a suitable ISO for adding a node to an existing cluster
The ignition assets currently assembles the ignition file with the requires files and services to install a cluster. In case of add node, this needs to be modified to support the new workflow.
Usually the manifests assets (ClusterImageSet / AgentPullSecret / InfraEnv / NMStateConfig / AgentClusterInstall / ClusterDeployment) depends on OptionInstallConfig (or eventually a file on the asset dir, in case of ZTP manifests). We'll need to change the assets code so that it could be possible to retrieve the required info from ClusterInfo asset instead of OptionalInstallConfig). This may impact the asset framework itself.
Another approach could be to stick this info directly into OptionalInstallConfig, if possible
Create a new asset to manage the content of the nodes-config.yaml file
Not all the required info are provided by the user (in reality, we do want to minimize as much as possible the amount of configuration provided by the user). Some of the required info needs to be extracted from the existing cluster, or from the existing kubeconfig. A dedicated asset could be useful for such operation.
The two commands, one for adding the nodes (ISO generation) and the other to monitor the process, should be exposed by a new cli tool (name to be defined) built using the installer source. This task will be used to add the main of the cli tool and the two (empty) commands entry points
Customers can trust the metadata in our operators catalogs to reason about infrastructure compatibility and interoperability. Similar to OCPPLAN-7983 the requirement is that this data is present for every layered product and Red Hat-release operator and ideally also ISV operators.
Today it is hard to validate the presence of this data due to the metadata format. This features tracks introducing a new format, implementing the appropriate validation and enforcement of presence as well as defining a grace period in which both formats are acceptable.
Customers can rely on the operator metadata as the single source of truth for capability and interoperability information instead of having to look up product-specific documentation. They can use this data to filter in on-cluster and public catalog displays as well as in their pipelines or custom workflows.
Red Hat Operators are required to provide this data and we aim for near 100% coverage in our catalogs.
Absence of this data can reliably be detected and will subsequently lead to gating in the release process.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
In the MCO today, we always reboot during initial node bootstrap pivot. This is because machine-config CRD also manages non-ignition field like OSImageURL, kernelArguments, extensions, etc. Any update in these fields would require node reboot.
Most of the time cluster OSImageURL is different than boot image and hence it results in node reboot.
However, there are certain use cases (like bare metal) where we would like to skip this reboot to bring up node faster. The cluster admin would boot the node with a boot image matching matching with cluster OSImageURL but it will still lead to reboot.
In the MCO today, we always reboot during initial node bootstrap pivot. This is because machine-config crd also manages non-ignition field like OSImageURL, kernelArguments, extensions, etc. Any update in thse fields would require node reboot.
Most of the time cluster OSImageURL is different than bootimage and hence it results in node reboot.
However, there are certain usecases (like baremetal) where we would like to skip this reboot to bring up node faster. The cluster admin would boot the node with bootimage matching matching with cluster OSImageURL but it will still lead to reboot.
For this effort, two area has been identified where it is possible that MCO can be improved to skip reboot during initial pivot:
Related RFE - https://issues.redhat.com/browse/RFE-3621
Note: With additional findings from Assisted installer team, scope of work has been re-framed to meet the requirement of assisted installer workflow.
After https://github.com/openshift/machine-config-operator/pull/3814 merges, it will be possible to use kernelArgs functionality that has been introduce in ignition. We can use this to sync-up kernelArgs supplied through MachineConfig to ignition field. As a result, MachineConfig supplied kargs can be available when node boots up and we don't need require a reboot.
Acceptance Criteria:
Related story- https://issues.redhat.com/browse/MCO-217
Related RFE- https://issues.redhat.com/browse/RFE-3621
The systemd service ovs-configuration.service is skipped if the file /etc/ignition-machine-config-encapsulated.json exists. The reason is that there is an assumption that reboot will be done if the file exists.
When we want to skip reboot, we need to verify that the service is not skipped. Therefore, the service will retry to configure until the file does not exist.
PR After PR https://github.com/openshift/os/pull/657 lands in, RHCOS nodes booting from a bootimage will have digested pull spec which we available in filed `container-image-reference-digest` via rpm-ostree status.
With this, we can teach MCD to look for container-image-reference-digest for comparison when OSImageURL is not available (this is the case when node boots from a bootimage). When it matches, we can say that both bootimage and OCP cluster has same OS Content and we can safely skip the node reboot during initial pivot.
Note: Scope of this work has been reduced to PR https://github.com/openshift/machine-config-operator/pull/3857 as this is sufficient for Assisted installer use case today.
A SRE/Cluster Admin will be able to use the multi payload in the same way as a single arch payload for a single arch cluster when installing with agent installer. This is most useful when using the agent installer in disconnected environments.
The agent installer will work for installs involving the multi-arch payload
The agent installer will work for installs involving the multi-arch payload
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
You will not be able to install multi-architecture clusters - nodes of a different architecture will need to be added as a day 2 operation
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. Initial completion during Refinement status.
Which other projects and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
As a SRE/Cluster Admin I expect that the multi payload can be used the same way as a single arch payload for a single arch cluster when installing with agent installer
AC:
non-goal:
The current agent code uses `oc` to extract files from the release payload. Need to add `--filter-by-os` to the appropriate `oc` templates to ensure a user can create an agent.<arch>.iso with the multi payload on any supported client arch.
Provide a unified view of all options to create Serverless Functions so developers can get quickly started with their preferred option.
A single UI panel in ODC that shows in-product and off-product choices to create Serverless functions, and provides an in-product creation experience for applicable choices.
An inline editor is out of scope.
Background: We need to provide the dev console experience for Serverless Function Experience.
Miro Board for inspiration and vision:
https://miro.com/app/board/uXjVPNtkJCI=/
Acceptance Criteria is in the parent feature.
When navigated to Functions page and click on any function, along with existing Details and YAML tab, we should add Revisions, Routes and Pods which are associated to that function.
Description of problem:
Styling issue in functions list page after PatternFly upgrade
Version-Release number of selected component (if applicable):
4.15.0
How reproducible:
Always
Steps to Reproduce:
1.Install serverless operator 2.Create knative-serving instance 3.Go to Functions menu in Dev perspective
Actual results:
https://github.com/openshift/console/pull/13348#issuecomment-1822469023
Expected results:
Styling should be proper
Additional info:
Update quick start name and document links for getting started section based on information provided in document https://docs.google.com/document/d/1xy9GwGR5m4p9W_RJ8Wt_164LpCP57x-lNJlObGfBJ9M/edit#heading=h.beapmz2o0lv7
AC: Add serverless function icon in Add page group header(Attached image for reference)
Slack thread - https://redhat-internal.slack.com/archives/C05MDC1T35J/p1700149065933539
Create getting started content in functions list page and add CLI, IDE extensions, samples links to it (Design is yet to be provided, so start with exploration and with dummy data)
Design is yet to be provided, so start with exploration and with dummy data
In the developer perspective, in the left side navigation menu, add Functions tab inside Resources section, which will list down all the serverless functions created for the specific namepsace and on click of function, open the Service details tab
We need to ensure we have parity with OCP and support heterogeneous clusters
https://github.com/openshift/enhancements/pull/1014
Using a multi-arch Node requires the HC to be multi arch as well. This is an good to recipe to let users shoot on their foot. We need to automate the required input via CLI to multi-arch NodePool to work, e.g. on HC creation enabling a multi-arch flag which sets the right release image
Acceptance Criteria:
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a user of HyperShift, I would like the UX around the `arch` flag validation improved so that it results in a smoother UX experience. The problem today is we default Arch to `amd64`, but then throw an invalid status message on the NodePool CRD if it's not blank and the platform is not AWS.
DEFINE once path forward decided; SEE Engineering Details for more details.
Description of criteria:
Detail about what is specifically not being delivered in the story
At a minimum we should remove the empty `Arch` flag check here.
What about modifying the section to something like this:
// Validate modifying CPU arch support for platform if (nodePool.Spec.Arch != "amd64") && (nodePool.Spec.Platform.Type != hyperv1.AWSPlatform) { SetStatusCondition(&nodePool.Status.Conditions, hyperv1.NodePoolCondition{ Type: hyperv1.NodePoolValidArchPlatform, Status: corev1.ConditionFalse, Reason: hyperv1.NodePoolInvalidArchPlatform, Message: fmt.Sprintf("modifying CPU arch from 'amd64' not supported for platform: %s", nodePool.Spec.Platform.Type), ObservedGeneration: nodePool.Generation, }) }
This requires/does not require a design proposal.
This requires/does not require a feature gate.
When the cluster does not have v1 builds, console needs to either provide different ways to build applications or prevent erroneous actions.
Identify the build system in place and prompt user accordingly when building applications.
Console will have to hide any workflows that rely solely on buildconfigs and pipelines is not installed.
ODC Jira - https://issues.redhat.com/browse/ODC-7352
When the cluster does not have v1 builds, console needs to either provide different ways to build applications or prevent erroneous actions.
Identify the build system in place and prompt user accordingly when building applications.
Without this enhancement, users will encounter issues when trying to create applications on clusters that do not have the default s2i setup.
Console will have to hide any workflows that rely solely on buildconfigs and pipelines is not installed.
If we detect Shipwright, then we can call that API instead of buildconfigs. We need to understand the timelines for the latter part, and create a separate work item for it.
If both buildconfigs and Shipwright are available, then we should default to Shipwright. This will be part of the separate work item needed to support Shipwright.
Rob Gormley to confirm timelines when customers will have to option to remove buildconfigs from their clusters. That will determine whether we take on this work in 4.15 or 4.16.
As a user, I would like to use the Import from Git form even if I don't have BC installed in my cluster, but I have installed the Pipelines operator.
As a user, I want to use the Import from Git form without any errors, to create the Pipeline for my Git Application if I have disabled Builds and installed Pipelines in the cluster.
(During the implementation, we are also trying to keep in mind the changes that have to be made later while adding SW into this form)
This is an initial prototype. This needs to be presented to the PMs for their feedback and updated accordingly.
The final UI must have the acknowledgement from the PMs and after that has to be merged.
Description of problem:
Hide the Builds NavItem if BuildConfig is not installed in the cluster
As a user, I dont want to see the option of "DeploymentConfigs" in any form I am filling, when I have not installed the same in the cluster.
Change API designator from alpha to beta for v1 Shipwright builds.
To maintain currency with API specs
ODC Jira - https://issues.redhat.com/browse/ODC-7353
Change API designator from alpha to beta for v1 Shipwright builds.
To maintain currency with API specs
In order to avoid an increased support overhead once the license changes at the end of the year, we should replace the instances in which metal IPI uses Terraform.
When we used Terraform to provision the control plane, the Terraform deployment could eventually time out and report an error. The installer was monitoring the Terraform output and could pass the error on to the user, e.g.
level=debug msg=ironic_node_v1.openshift-master-host[0]: Still creating... [2h9m1s elapsed]
level=error
level=error msg=Error: could not inspect: inspect failed , last error was 'timeout reached while inspecting the node'
level=error
level=error msg= with ironic_node_v1.openshift-master-host[2],
level=error msg= on main.tf line 13, in resource "ironic_node_v1" "openshift-master-host":
level=error msg= 13: resource "ironic_node_v1" "openshift-master-host" {
Now that provisioning is managed by Metal³, we have nothing monitoring it for errors:
level=info msg=Waiting up to 1h0m0s (until 1:05AM UTC) for bootstrapping to complete...
level=debug msg=Bootstrap status: complete
By this stage the bootstrap API is up (and this is a requirement for BMO to do its thing). The installer is capable of monitoring the API for the appearance of the bootstrap complete ConfigMap, so it is equally capable of monitoring the BaremetalHost status. This should actually be an improvement on Terraform, as we can monitor in real time as the hosts progress through the various stages, and report on errors and retries.
We will re-implement the functionality of terraform-provider-libvirt using Go libraries.
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision vSphere infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
okd required guestinfo domain
and
stealclock accounting
Causes password leaking in CI.
WARNING FeatureSet "CustomNoUpgrade" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster.
INFO Creating infrastructure resources...
DEBUG Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'
DEBUG The file was found in cache: /home/jcallen/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing...
<very long pause here with no indication>
Description of problem:
the installer download the rhcos image locally to cache multiple times when using failure domains
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
time="2024-02-22T14:34:42-05:00" level=debug msg="Generating Cluster..." time="2024-02-22T14:34:42-05:00" level=warning msg="FeatureSet \"CustomNoUpgrade\" is enabled. This FeatureSet does not allow upgrades and may affect the supportability of the cluster." time="2024-02-22T14:34:42-05:00" level=info msg="Creating infrastructure resources..." time="2024-02-22T14:34:43-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'" time="2024-02-22T14:34:43-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..." time="2024-02-22T14:36:02-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'" time="2024-02-22T14:36:02-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..." time="2024-02-22T14:37:22-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'" time="2024-02-22T14:37:22-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..." time="2024-02-22T14:38:39-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'" time="2024-02-22T14:38:39-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..." time="2024-02-22T14:39:33-05:00" level=debug msg="Obtaining RHCOS image file from 'https://rhcos.mirror.openshift.com/art/storage/prod/streams/4.16-9.4/builds/416.94.202402130130-0/x86_64/rhcos-416.94.202402130130-0-vmware.x86_64.ova?sha256=fb823b5c31afbb0b9babd46550cb9fe67109a46f48c04ce04775edbb6a309dd3'" time="2024-02-22T14:39:33-05:00" level=debug msg="The file was found in cache: /home/ngirard/.cache/openshift-installer/image_cache/rhcos-416.94.202402130130-0-vmware.x86_64.ova. Reusing..." time="2024-02-22T14:39:33-05:00" level=error msg="failed to fetch Cluster: failed to generate asset \"Cluster\": failed to create cluster: failed during pre-provisioning: unable to initialize folders and templates: failed to import ova: failed to lease wait: The name 'ngirard-dev-pr89z-rhcos-us-west-us-west-1a' already exists."
Expected results:
should only download once
Additional info:
Replace https://github.com/openshift/installer/tree/master/upi/vsphere with powercli. Keep terraform in place until powercli installations are working.
example of updates to be made to the upi image:
~~~
FROM upi-installer-image
RUN curl https://packages.microsoft.com/config/rhel/8/prod.repo | tee
/etc/yum.repos.d/microsoft.repo
RUN yum install -y powershell
RUN pwsh -Command 'Install-Module VMware.PowerCLI -Force -Scope
CurrentUser'
~~~
{}USER STORY:{}
The assisted installer should be able to use CAPI-based vsphere installs without requiring access to vcenter.
{}DESCRIPTION:{}
The installer makes calls to vcenter to determine the networks, which are required for CAPI based installs, but vcenter access is not guaranteed in the assisted installer.
See:
which were lovingly lifted from this slack thread.
{}Required:{}
In cases where the installer calls vcenter to obtain values to populate manifests, the installer should leave empty fields (or a default value) if it is unable to access vcenter. It should produce partial manifests, rather than throw an error.
{}Nice to have:{}
...
{}ACCEPTANCE CRITERIA:{}
Continued compatibility with agent installer, particularly producing capi manifests when access to vcenter fails.
{}ENGINEERING DETAILS:{}
<!--
Any additional information that might be useful for engineers: related
repositories or pull requests, related email threads, GitHub issues or
other online discussions, how to set up any required accounts and/or
environments if applicable, and so on.
-->
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
level=info msg=Running process: vsphere infrastructure provider with args [-v=2 --metrics-bind-addr=0 --health-addr=127.0.0.1:37167 --webhook-port=38521 --webhook-cert-dir=/tmp/envtest-serving-certs-445481834 --leader-elect=false] and env [...]
may contain sensitive data - passwords, logins etc. It should be filtered
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision Azure infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As a developer, I want to:
so that I can achieve
Description of criteria:
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Launch a cluster provisioned with CAPI usining a minimal (mostly default) install config.
In the ignition hook, we need to upload the bootstrap ignition data to the bootstrap storage account (created to hold the rhcos image). Then create an ignition stub containing the SAS link for the object.
Create private zone and DNS records in the resource group specified by baseDomainResourceGroupName. The records should be cleaned up with destroy cluster.
The ControlPlaneEndpoint will be available in the Cluster spec and can be used to populate the DNS records.
Currently we create both A and CNAME records in different scenarios: https://github.com/openshift/installer/blob/master/data/data/azure/cluster/dns/dns.tf
Ideally we do this in the InfraReady hook, before machine creation, so that control plane machines can pull ignition immediately.
The install config allows users to specify a `diskEncryptionSet` in machinepools.
CAPZ has existing support for disk encryption sets:
Note that CAPZ says the encryption set must belong to the same subscription, whereas our docs may not indicate that. We should point this out to the docs team.
The `image` field in the AzureMachineSpec needs to point to an RHCOS image. For marketplace images, those images should already be available.
For non-marketplace images, we need to create an image for the users, using the VHD from the RHCOS stream.
The image could created in the PreProvision hook: https://github.com/openshift/installer/blob/master/pkg/infrastructure/clusterapi/types.go#L26
Technicaly it could also be done in the InfraAvailable hook, if that is needed.
As a result of Hashicorp's license change to BSL, Red Hat OpenShift needs to remove the use of Hashicorp's Terraform from the installer – specifically for IPI deployments which currently use Terraform for setting up the infrastructure.
To avoid an increased support overhead once the license changes at the end of the year, we want to provision Nutanix infrastructure without the use of Terraform.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
A guest cluster can use an external OIDC token issuer. This will allow machine-to-machine authentication workflows
A guest cluster can configure OIDC providers to support the current capability: https://kubernetes.io/docs/reference/access-authn-authz/authentication/#openid-connect-tokens and the future capability: https://github.com/kubernetes/kubernetes/blob/2b5d2cf910fd376a42ba9de5e4b52a53b58f9397/staging/src/k8s.io/apiserver/pkg/apis/apiserver/types.go#L164 with an API that
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
A guest cluster can use an external OIDC token issuer. This will allow machine-to-machine authentication workflows
A guest cluster can configure OIDC providers to support the current capability: https://kubernetes.io/docs/reference/access-authn-authz/authentication/#openid-connect-tokens and the future capability: https://github.com/kubernetes/kubernetes/blob/2b5d2cf910fd376a42ba9de5e4b52a53b58f9397/staging/src/k8s.io/apiserver/pkg/apis/apiserver/types.go#L164 with an API that
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Description of problem:
In 4.15 cluster, after configuring external OIDC, kube-apiserver pod crashes with: run.go:74] "command failed" err="strict decoding error: unknown field \"jwt[0].claimMappings.uid\"". In 4.16 cluster, this issue is not reproduced.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-02-28-013638
How reproducible:
Unsure for now.
Steps to Reproduce:
1. Install 4.15 fresh HCP env and configure external OIDC: $ HC_NAME=hypershift-ci-267403 MGMT_KUBECONFIG=/home/xxia/my/env/xxia-hs415-267403-4.15/kubeconfig HOSTED_KUBECONFIG=/home/xxia/my/env/xxia-hs415-267403-4.15/hypershift-ci-267403.kubeconfig AUDIENCE=76863fb1-xxxxxx ISSUER_URL=https://login.microsoftonline.com/xxxxxxxx/v2.0 CLIENT_ID=76863fb1-xxxxxx CLIENT_SECRET_VALUE="xxxxxxxx" CLIENT_SECRET_NAME=console-secret $ curl -sS "$ISSUER_URL/.well-known/openid-configuration" > microsoft-entra-id-oauthMetadata $ export KUBECONFIG=$HOSTED_KUBECONFIG $ oc create configmap tested-oauth-meta --from-file=oauthMetadata=microsoft-entra-id-oauthMetadata -n clusters --kubeconfig $MGMT_KUBECONFIG configmap/tested-oauth-meta created $ oc patch hc $HC_NAME -n clusters --kubeconfig $MGMT_KUBECONFIG --type=merge -p=" spec: configuration: authentication: oauthMetadata: name: tested-oauth-meta oidcProviders: - claimMappings: groups: claim: groups prefix: 'oidc-groups-test:' username: claim: email prefixPolicy: Prefix prefix: prefixString: 'oidc-user-test:' issuer: audiences: - $AUDIENCE issuerURL: $ISSUER_URL name: microsoft-entra-id oidcClients: - clientID: $CLIENT_ID clientSecret: name: $CLIENT_SECRET_NAME componentName: console componentNamespace: openshift-console type: OIDC " hostedcluster.hypershift.openshift.io/hypershift-ci-267403 patched $ oc create secret generic console-secret -n clusters --from-literal=clientSecret=$CLIENT_SECRET_VALUE --kubeconfig $MGMT_KUBECONFIG secret/console-secret created $ oc get authentication.config cluster -o yaml apiVersion: config.openshift.io/v1 kind: Authentication metadata: ... spec: oauthMetadata: name: tested-oauth-meta oidcProviders: - claimMappings: groups: claim: groups prefix: 'oidc-groups-test:' username: claim: email prefix: prefixString: 'oidc-user-test:' prefixPolicy: Prefix issuer: audiences: - 76863fb1-xxxxxx issuerCertificateAuthority: name: "" issuerURL: https://login.microsoftonline.com/xxxxxxxx/v2.0 name: microsoft-entra-id oidcClients: - clientID: 76863fb1-xxxxxx clientSecret: name: console-secret componentName: console componentNamespace: openshift-console serviceAccountIssuer: https://xxxxxxxx.s3.us-east-2.amazonaws.com/hypershift-ci-267403 type: OIDC status: oidcClients: - componentName: console componentNamespace: openshift-console conditions: - lastTransitionTime: "2024-02-28T09:20:08Z" message: "" reason: OIDCConfigAvailable status: "False" type: Degraded - lastTransitionTime: "2024-02-28T09:20:08Z" message: "" reason: OIDCConfigAvailable status: "False" type: Progressing - lastTransitionTime: "2024-02-28T09:20:08Z" message: "" reason: OIDCConfigAvailable status: "True" type: Available currentOIDCClients: - clientID: 76863fb1-xxxxxx issuerURL: https://login.microsoftonline.com/xxxxxxxx/v2.0 oidcProviderName: microsoft-entra-id $ oc get po -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG --sort-by metadata.creationTimestamp NAME READY STATUS RESTARTS AGE ... openshift-apiserver-d665bdc58-7cfdg 3/3 Running 0 154m kube-controller-manager-577cf4566f-sgxz2 1/1 Running 0 154m openshift-apiserver-d665bdc58-52w9m 3/3 Running 0 154m kube-apiserver-74f569dfb5-7tnmn 4/5 CrashLoopBackOff 7 (2m47s ago) 15m $ oc logs --timestamps -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG -c kube-apiserver kube-apiserver-74f569dfb5-7tnmn > ~/my/logs/kube-apiserver-74f569dfb5-7tnmn-CrashLoopBackOff-hcp415.log $ oc get cm auth-config -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG -o jsonpath='{.data.auth\.json}' {"kind":"AuthenticationConfiguration","apiVersion":"apiserver.config.k8s.io/v1alpha1","jwt":[{"issuer":{"url":"https://login.microsoftonline.com/xxxxxxxx/v2.0","audiences":["76863fb1-xxxxxx"],"audienceMatchPolicy":"MatchAny"},"claimMappings":{"username":{"claim":"email","prefix":"oidc-user-test:"},"groups":{"claim":"groups","prefix":"oidc-groups-test:"},"uid":{}}}]} $ oc get cm auth-config -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG -o jsonpath='{.data.auth\.json}' | jq | ~/auto/json2yaml.sh --- kind: AuthenticationConfiguration apiVersion: apiserver.config.k8s.io/v1alpha1 jwt: - issuer: url: https://login.microsoftonline.com/xxxxxxxx/v2.0 audiences: - 76863fb1-xxxxxx audienceMatchPolicy: MatchAny claimMappings: username: claim: email prefix: 'oidc-user-test:' groups: claim: groups prefix: 'oidc-groups-test:' uid: {} $ vi ~/my/logs/kube-apiserver-74f569dfb5-7tnmn-CrashLoopBackOff-hcp415.log ... 2024-02-28T09:32:06.077307893Z I0228 09:32:06.077298 1 options.go:220] external host was not specified, using 172.20.0.1 2024-02-28T09:32:06.077977888Z I0228 09:32:06.077952 1 server.go:189] Version: v1.28.6+6216ea1 2024-02-28T09:32:06.077977888Z I0228 09:32:06.077971 1 server.go:191] "Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK="" 2024-02-28T09:32:06.078556862Z I0228 09:32:06.078543 1 dynamic_serving_content.go:113] "Loaded a new cert/key pair" name="serving-cert::/etc/kubernetes/certs/server/tls.crt::/etc/kubernetes/certs/server/tls.key" 2024-02-28T09:32:06.408308750Z I0228 09:32:06.408274 1 dynamic_cafile_content.go:119] "Loaded a new CA Bundle and Verifier" name="client-ca-bundle::/etc/kubernetes/certs/client-ca/ca.crt" 2024-02-28T09:32:06.408487434Z E0228 09:32:06.408467 1 run.go:74] "command failed" err="strict decoding error: unknown field \"jwt[0].claimMappings.uid\""
Actual results:
As shown in above "Description of problem"
Expected results:
4.16 does not have the issue. 4.15 should have no such problem.
Additional info:
Description of problem:
Updating oidcProviders does not take effect. See details below.
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-02-26-155043
How reproducible:
Always
Steps to Reproduce:
1. Install fresh HCP env and configure external OIDC as steps 1 ~ 4 of https://issues.redhat.com/browse/OCPBUGS-29154 (to avoid repeated typing those steps, only referencing as is here). 2. Pods renewed: $ oc get po -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG --sort-by metadata.creationTimestamp ... network-node-identity-68b7b8dd48-4pvvq 3/3 Running 0 170m oauth-openshift-57cbd9c797-6hgzx 2/2 Running 0 170m kube-controller-manager-66f68c8bd8-tknvc 1/1 Running 0 164m kube-controller-manager-66f68c8bd8-wb2x9 1/1 Running 0 164m kube-controller-manager-66f68c8bd8-kwxxj 1/1 Running 0 163m kube-apiserver-596dcb97f-n5nqn 5/5 Running 0 29m kube-apiserver-596dcb97f-7cn9f 5/5 Running 0 27m kube-apiserver-596dcb97f-2rskz 5/5 Running 0 25m openshift-apiserver-c9455455c-t7prz 3/3 Running 0 22m openshift-apiserver-c9455455c-jrwdf 3/3 Running 0 22m openshift-apiserver-c9455455c-npvn5 3/3 Running 0 21m konnectivity-agent-7bfc7cb9db-bgrsv 1/1 Running 0 20m cluster-version-operator-675745c9d6-5mv8m 1/1 Running 0 20m hosted-cluster-config-operator-559644d45b-4vpkq 1/1 Running 0 20m konnectivity-agent-7bfc7cb9db-hjqlf 1/1 Running 0 20m konnectivity-agent-7bfc7cb9db-gl9b7 1/1 Running 0 20m 3. oc login can succeed: $ oc login --exec-plugin=oc-oidc --client-id=$CLIENT_ID --client-secret=$CLIENT_SECRET_VALUE --extra-scopes=email --callback-port=8080 Please visit the following URL in your browser: http://localhost:8080 Logged into "https://a4af9764....elb.ap-southeast-1.amazonaws.com:6443" as "oidc-user-test:xxia@redhat.com" from an external oidc issuer.You don't have any projects. Contact your system administrator to request a project. 4. Update HC by changing claim: email to claim: sub: $ oc edit hc $HC_NAME -n clusters --kubeconfig $MGMT_KUBECONFIG ... username: claim: sub ... Update is picked up: $ oc get authentication.config cluster -o yaml ... spec: oauthMetadata: name: tested-oauth-meta oidcProviders: - claimMappings: groups: claim: groups prefix: 'oidc-groups-test:' username: claim: sub prefix: prefixString: 'oidc-user-test:' prefixPolicy: Prefix issuer: audiences: - 76863fb1-xxxxxx issuerCertificateAuthority: name: "" issuerURL: https://login.microsoftonline.com/xxxxxxxx/v2.0 name: microsoft-entra-id oidcClients: - clientID: 76863fb1-xxxxxx clientSecret: name: console-secret componentName: console componentNamespace: openshift-console serviceAccountIssuer: https://xxxxxx.s3.us-east-2.amazonaws.com/hypershift-ci-267402 type: OIDC status: oidcClients: - componentName: console componentNamespace: openshift-console conditions: - lastTransitionTime: "2024-02-28T10:51:17Z" message: "" reason: OIDCConfigAvailable status: "False" type: Degraded - lastTransitionTime: "2024-02-28T10:51:17Z" message: "" reason: OIDCConfigAvailable status: "False" type: Progressing - lastTransitionTime: "2024-02-28T10:51:17Z" message: "" reason: OIDCConfigAvailable status: "True" type: Available currentOIDCClients: - clientID: 76863fb1-xxxxxx issuerURL: https://login.microsoftonline.com/xxxxxxxx/v2.0 oidcProviderName: microsoft-entra-id 4. Check pods again: $ oc get po -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG --sort-by metadata.creationTimestamp ... kube-apiserver-596dcb97f-n5nqn 5/5 Running 0 108m kube-apiserver-596dcb97f-7cn9f 5/5 Running 0 106m kube-apiserver-596dcb97f-2rskz 5/5 Running 0 104m openshift-apiserver-c9455455c-t7prz 3/3 Running 0 102m openshift-apiserver-c9455455c-jrwdf 3/3 Running 0 101m openshift-apiserver-c9455455c-npvn5 3/3 Running 0 100m konnectivity-agent-7bfc7cb9db-bgrsv 1/1 Running 0 100m cluster-version-operator-675745c9d6-5mv8m 1/1 Running 0 100m hosted-cluster-config-operator-559644d45b-4vpkq 1/1 Running 0 100m konnectivity-agent-7bfc7cb9db-hjqlf 1/1 Running 0 99m konnectivity-agent-7bfc7cb9db-gl9b7 1/1 Running 0 99m No new pods renewed. 5. Check login again, it does not use "sub", still use "email": $ rm -rf ~/.kube/cache/ $ oc login --exec-plugin=oc-oidc --client-id=$CLIENT_ID --client-secret=$CLIENT_SECRET_VALUE --extra-scopes=email --callback-port=8080 Please visit the following URL in your browser: http://localhost:8080 Logged into "https://xxxxxxx.elb.ap-southeast-1.amazonaws.com:6443" as "oidc-user-test:xxia@redhat.com" from an external oidc issuer. You don't have any projects. Contact your system administrator to request a project. $ cat ~/.kube/cache/oc/* | jq -r '.id_token' | jq -R 'split(".") | .[] | @base64d | fromjson' ... { ... "email": "xxia@redhat.com", "groups": [ ... ], ... "sub": "EEFGfgPXr0YFw_ZbMphFz6UvCwkdFS20MUjDDLdTZ_M", ...
Actual results:
Steps 4 ~ 5: after editing HC field value from "claim: email" to "claim: sub", even if `oc get authentication cluster -o yaml` shows the edited change is propagated: 1> The pods like kube-apiserver are not renewed. 2> After clean-up ~/.kube/cache, `oc login ...` relogin still prints 'Logged into "https://xxxxxxx.elb.ap-southeast-1.amazonaws.com:6443" as "oidc-user-test:xxia@redhat.com" from an external oidc issuer', i.e. still uses old claim "email" as user name, instead of using the new claim "sub".
Expected results:
Steps 4 ~ 5: Pods like kube-apiserver pods should renew after HC editing that changes user claim. The login should print that the new claim is used as user name.
Additional info:
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
We need the updated to the Authentication API to detect the Authentication Type for the cluster and deploy or not deploy the oauth components based on the set type.
When the internal oauth-server and oauth-apiserver are removed and replaced with an external OIDC issuer (like azure AD), the console must work for human users of the external OIDC issuer.
An end user can use the openshift console without a notable difference in experience. This must eventually work on both hypershift and standalone, but hypershift is the first priority if it impacts delivery
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Console server code needs refactoring in order to move forward with backend changes for introducing auth against external OIDC.
AC: Move auth config to its own module
User API may not always be available. K8S now has a stable API to query for user information - https://github.com/kubernetes/enhancements/tree/master/keps/sig-auth/3325-self-subject-attributes-review-api. See if it can be used and replace all `user/~` calls with it.
When the console is using an external oidc token the users and groups sections of the UI are no longer relevant, and we need not render them.
Acceptance criteria:
The web console should behave like a generic OIDC client when requesting tokens from an OIDC provider.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Console needs to be able to auth agains external OIDC IDP. For that console-operator need to set configure it in that order.
AC:
Enable a "Break Glass Mechanism" in ROSA (Red Hat OpenShift Service on AWS) and other OpenShift cloud-services in the future (e.g., ARO and OSD) to provide customers with an alternative method of cluster access via short-lived certificate-based kubeconfig when the primary IDP (Identity Provider) is unavailable.
Create a new short-lived signer CA that signs a cluster-admin kubeconfig we provide to the customer upon request.
The CA must be trusted by the KAS and included in the CA bundle along with the CA that will sign longer lived cert-based creds like those used by SRE.
TBD: should we create the signer at the point of kubeconfig request from the customer? Or should we always have the signer active through periodic rotation?
On-demand signer:
Always valid signer with rotation:
As a (user persona), I want to be able to:
so that I can achieve
Description of criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
To align with the 4.16 release, dependencies need to be updated to 1.29. This should be done by rebasing/updating as appropriate for the repository
This story relates to this PR https://github.com/openshift/machine-config-operator/pull/4275
A new PR has been opened to investigate the issues found in the original PR (this is the link to the new PR): https://github.com/openshift/machine-config-operator/pull/4306
The original PR exceeded the watch request limits when merged. When discovered, the CNTO team needed to revert it. (see https://redhat-internal.slack.com/archives/C01CQA76KMX/p1711711538689689).
To investigate if exceeding the watch request limit was introduced from the API bump and its associated changes, or the kubeconfig changes, an additional PR was opened just for looking at removing the hardcoded values from the kubelet template, and payload tests were run against it: https://github.com/openshift/machine-config-operator/pull/4270. The payload tests passed, and it was concluded that the watch request limit issue was introduced in the portion of the PR that included the API bump and its associated changes.
It was discovered that the CNTO team was using an outdated form of openshift deps, so they were asked to bump. https://redhat-internal.slack.com/archives/CQNBUEVM2/p1712171079685139?thread_ts=1711712855.478249&cid=CQNBUEVM2
https://github.com/openshift/cluster-node-tuning-operator/pull/990
was opened in the past to address the kube bump (this just merged), and https://github.com/openshift/cluster-node-tuning-operator/pull/1022
was opened as well (still open)
CURRENT STATUS: waiting for https://github.com/openshift/cluster-node-tuning-operator/pull/1022 to merge so we can rerun payload tests against the revert PR open.
As a MCO developer, I want to pick up the openshift/kubernetes updates for the 1.29 k8s rebase to track the k8s version as rest of the OpenShift 1.29 cluster.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Follow the rebase doc[1] and update the spreadsheet[2] that tracks the required commits to be cherry-picked. Rebase the o/k repo with the "merge=ours" strategy as mentioned in the rebase doc.
Save the last commit id in the spreadsheet for future references.
Update the rebase doc if required.
[1] https://github.com/openshift/kubernetes/blob/master/REBASE.openshift.md
[2] https://docs.google.com/spreadsheets/d/10KYptJkDB1z8_RYCQVBYDjdTlRfyoXILMa0Fg8tnNlY/edit#gid=1957024452
Prev. Ref:
https://github.com/openshift/kubernetes/pull/1646
OCP 4 clusters still maintain pinned boot images. We have numerous clusters installed that have boot media pinned to first boot images as early as 4.1. In the future these boot images may not be certified by the OEM and may fail to boot on updated datacenter or cloud hardware platforms. These "pinned" boot images should be updateable so that customers can avoid this problem and better still scale out nodes with boot media that matches the running cluster version.
The bootimage references are currently saved off in the machineset by the openshift installer and is thereafter unmanaged. This machineset object is not updated on an upgrade, so any node scaled up using it will boot up with the original “install” bootimage.
The “new” boot image references are available in a configmap/coreos-bootimages in the MCO namespace. Here is the PR that implemented this, it’s basically a CVO manifest that pulls from this file in the installer binary. Hence, they are updated on an upgrade. It can also be printed out to console by the following command on the installer: /openshift-install coreos print-stream-json.
Implementing this portion should be as simple as iterating through each machineset, and updating the new disk image by crossreferencing the configmap, architecture, region and the platform used in the machineset. This is where the installer figures out the bootimage during an install, so we could model a bit after this.
It looks like we have Machine API objects for every platform specific providerSpec(formally called providerConfig) we support here. We'd still have to special case the image/ami actual portion of this, but we should be able to leverage some of the work done in the installer(to generate machinesets, for example, GCP) to understand how the image reference is stored for every platform.
Done when:
For MVP, the goal is to
This will pick up stories left off from the initial Tech Preview(Phase 1): https://issues.redhat.com/browse/MCO-589
A ValidatingAdmissionPolicy should be implemented(via an MCO manifest) for changes to this new API object, so that the feature is not turned on in unsupported platforms. The only platform currently supported is GCP. The ValidationAdmissionPolicy is kube native and is behind its own feature gate, so this will have to be checked while applying these manifests. Here is what the YAML of what these manifests would look like:
--- apiVersion: admissionregistration.k8s.io/v1beta1 kind: ValidatingAdmissionPolicy metadata: name: "managed-bootimages-platform-check" spec: failurePolicy: Fail paramKind: apiVersion: config.openshift.io/v1 kind: Infrastructure matchConstraints: resourceRules: - apiGroups: ["operator.openshift.io"] apiVersions: ["v1"] operations: ["CREATE", "UPDATE"] resources: ["MachineConfiguration"] validations: - expression: "has(object.spec.ManagedBootImages) && param.status.platformStatus.Type == `GCP`" message: "This feature is only supported on these platforms: GCP" --- apiVersion: admissionregistration.k8s.io/v1beta1 kind: ValidatingAdmissionPolicyBinding metadata: name: "managed-bootimages-platform-check-binding" spec: policyName: "managed-bootimages-platform-check" validationActions: [Deny] paramRef: name: "cluster" parameterNotFoundAction: "Deny"
We'll want to add some tests to make sure the managing bootimages hasn't broken our existing functionality and that our new feature works. Proposed flow:
1/30/24: Updated based on enhancement discussions
The MachineSetBootImage Controller will create an alert if there are excessive failures to patch a MachineSet.
This is the MCO side of the changes. Once the API PR lands, the MSBIC should start watching for the new API object.
It is also important to note that MachineSets having an ownerreference should not opted in to this mechanism, even if they are opt-ed in via the API. See discussion here: https://github.com/openshift/enhancements/pull/1496#discussion_r1463386593
Done when:
Update 3/26/24 - Moved ValidatingAdmissionPolicy bit into a separate story as that got a bit more involved.
This feature is dedicated to enhancing data security and implementing encryption best practices across control-planes, Etcd, and nodes for HyperShift with Azure. The objective is to ensure that all sensitive data, including secrets is encrypted, thereby safeguarding against unauthorized access and ensuring compliance with data protection regulations.
Expose and propagate input for kms secret encryption similar to what we do in AWS.
See related discussion:
https://redhat-internal.slack.com/archives/CCV9YF9PD/p1696950850685729
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
This feature focuses on the optimization of resource allocation and image management within NodePools. This will include enabling users to specify resource groups at NodePool creation, integrating external DNS support, ensuring Cluster API (CAPI) and other images are sourced from the payload, and utilizing Image Galleries for Azure VM creation.
As a user of HyperShift on Azure, I would like the reconciliation process for the cloud controller manager to run the Azure external provider like we do for AWS and other platforms so that the NodePool nodes will join the HostedCluster.
Any other issues with Azure HostedClusters discovered during development.
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a user of HyperShift, I want the cluster API Azure (CAPZ) image to come from the OCP release image rather than being hardcoded in the HyperShift code so that I can always use the latest CAPZ image related to the OCP release image.
The CAPZ image comes from the OCP release image.
N/A
This requires/does not require a design proposal.
This requires/does not require a feature gate.
As a cluster service provider / consumer I want the hosted control plane endpoints to be resolvable through a known dns zone
Acceptance Criteria:
Detail about what is specifically not being delivered in the story
This requires/does not require a design proposal.
This requires/does not require a feature gate.
Stop generating long-lived service account tokens. Long-lived service account tokens are currently generated in order to then create an image pull secret for the internal image registry. This feature calls for using the TokenRequest API to generate a bound service account token for use in the image pull secret.
Use TokenRequest API to create image pull secrets.
{}Performance benefits:
One less secret created per service account. This will result in at least three less secrets generated per namespace.
Security benefits:
Long lived tokens which are no longer recommended as they present a possible security risk.
Requirements (aka. Acceptance Criteria):
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
High-level list of items that are out of scope. Initial completion during Refinement status.
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
Telecommunications providers look to displace Physical Network Functions (PNFs) with modern Virtual Network Functions (VNFs) at the Far Edge. Single Node OpenShift, as a the CaaS layer in the vRAN vDU architecture, must achieve a higher standard in regards to OpenShift upgrade speed and efficiency, as in comparison to PNFs.
Telecommunications providers currently deploy Firmware-based Physical Network Functions (PNFs) in their RAN solutions. These PNFs can be upgraded quickly due to their monolithic nature and image-based download-and-reboot upgrades. Furthermore they often have the ability to retry upgrades and to rollback to the previous image if the new image fails. These Telcos are looking to displace PNFs with virtual solutions, but will not do so unless the virtual solutions have comparable operational KPIs to the PNFs.
Service (vDU) Downtime is the time when the CNF is not operational and therefore no traffic is passing through the vDU. This has a significant impact as it degrades the customer’s service (5G->4G) or there’s an outright service outage. These disruptions are scheduled into Maintenance Windows (MW), but the Telecommunications Operators primary goal is to keep service running, so getting vRAN solutions with OpenShift to near PNF-like Service Downtime is and always will be a primary requirement.
Upgrading OpenShift is only one of many operations that occur during a Maintenance Window. Reducing the CaaS upgrade duration is meaningful to many teams within a Telecommunications Operators organization as this duration fits into a larger set of activities that put pressure on the duration time for Red Hat software. OpenShift must reduce the upgrade duration time significantly to compete with existing PNF solutions.
As mentioned above, the Service Downtime disruption duration must be as small as possible, this includes when there are failures. Hardware failures fall into a category called Break+Fix and are covered by TELCOSTRAT-165. In the case of software failures must be detected and remediation must occur.
Detection includes monitoring the upgrade for stalls and failures and remediation would require the ability to rollback to the previously well-known-working version, prior to the failed upgrade.
The OpenShift product support terms are too short for Telco use cases, in particular vRAN deployments. The risk of Service Downtime drives Telecommunications Operators to a certify-deploy-and-then-don’t-touch model. One specific request from our largest Telco Edge customer is for 4 years of support.
These longer support needs drive a misalignment with the EUS->EUS upgrade path and drive the requirement that the Single Node OpenShift deployment can be upgraded from OCP X.y.z to any future [X+1].[y+1].[z+1] where X+1 and x+1 are decided by the Telecommunications Operator depending on timing and the desired feature-set and x+1 is determined through Red Hat, vDU vendor and custom maintenance and engineering validation.
Red Hat is challenged with improving multiple OpenShift Operational KPIs by our telecommunications partners and customers. Improved Break+Fix is tracked in TELCOSTRAT-165 and improved Installation is tracked in TELCOSTRAT-38.
Whatever methodology achieves the above requirements must ensure that the customer has a pleasant experience via RHACM and Red Hat GitOps. Red Hat’s current install and upgrade methodology is via RHACM and any new technologies used to improve Operational KPIs must retain the seamless experience from the cluster management solution. For example, after a cluster is upgraded it must look the same to a RHACM Operator.
Whatever methodology achieves the above requirements must ensure that a technician troubleshooting a Single Node OpenShift deployment has a pleasant experience. All commands issued on the node must return output as it would before performing an upgrade.
Run tuneD on a container on a one-shot mode and read the output kernel arguments to apply them using a MachineConfig (MC).
This would be run in the bootstrap procedure of the Openshift Installer, just before the MachineConfigOperator(MCO) procedure here
Initial considerations: https://docs.google.com/document/d/1zUpcpFUp4D5IM4GbM4uWbzbjr57h44dS0i4zP-hek2E/edit
A systemd service that runs on a golden image first boot and configure the following:
1. networking ( the internal IP address require special attention)
2. Update the hostname (MGMT-15775)
3. Execute recert (regenereate certs, Cluster name and base domain MGMT-15533)
4. Start kubelet
5. Apply the personalization info:
If the answer is "yes", please make sure to check the corresponding option.
The following features depend on this functionality:
As we want to support IBU with single ip we should change dnsmasq and force dns configurations for sno in order to support ip change
In IBI and IBU flows we need a way to change nodeip-configuration hint file without reboot and before mco even starts. In order for MCO to be happy we need to remove this file from it's management to make it we will stop using machine config and move to ignition
DPDK applications require dedicated CPUs, and isolated any preemption (other processes, kernel threads, interrupts), and this can be achieved with the “static” policy of the CPU manager: the container resources need to include an integer number of CPUs of equal value in “limits” and “request”. For instance, to get six exclusive CPUs:
spec:
containers:
- name: CNF
image: myCNF
resources:
limits:
cpu: "6"
requests:
cpu: "6"
The six CPUs are dedicated to that container, however non trivial, meaning real DPDK applications do not use all of those CPUs as there is always at least one of the CPU running a slow-path, processing configuration, printing logs (among DPDK coding rules: no syscall in PMD threads, or you are in trouble). Even the DPDK PMD drivers and core libraries include pthreads which are intended to sleep, they are infrastructure pthreads processing link change interrupts for instance.
Can we envision going with two processes, one with isolated cores, one with the slow-path ones, so we can have two containers? Unfortunately no: going in a multi-process design, where only dedicated pthreads would run on a process is not an option as DPDK multi-process is going deprecated upstream and has never picked up as it never properly worked. Fixing it and changing DPDK architecture to systematically have two processes is absolutely not possible within a year, and would require all DPDK applications to be re-written. Knowing that the first and current multi-process implementation is a failure, nothing guarantees that a second one would be successful.
The slow-path CPUs are only consuming a fraction of a real CPU and can safely be run on the “shared” CPU pool of the CPU Manager, however containers specifications do not accept to request two kinds of CPUs, for instance:
spec:
containers:
- name: CNF
image: myCNF
resources:
limits:
cpu_dedicated: "4"
cpu_shared: "20m"
requests:
cpu_dedicated: "4"
cpu_shared: "20m"
Why do we care about allocating one extra CPU per container?
Let’s take a realistic example, based on a real RAN CNF: running 6 containers with dedicated CPUs on a worker node, with a slow Path requiring 0.1 CPUs means that we waste 5 CPUs, meaning 3 physical cores. With real life numbers:
Intel public CPU price per core is around 150 US$, not even taking into account the ecological aspect of the waste of (rare) materials and the electricity and cooling…
Requirement | Notes | isMvp? |
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This issue has been addressed lately by OpenStack.
N/A
The feature enablement is done through the performance profile.
We should follow what described in the EP ( https://github.com/openshift/enhancements/pull/1396) and add all the bits and bytes that are needed in NTO for the feature activation.
In order to protect the operator (and the cluster in general) from
Tech preview (TP) features, we should add feature gates support under NTO.
We need to add support to Kubelet to advertise the shared-cpu as `openshift.io/enabled-shared-cpus` through extended resources
This should be off by default and only activated when a configuration file is being supplied.
We need to extend the node admission plugin to support the shared cpus.
The admission should provide the following functionalities:
1. In case a user specifies more than a single `openshift.io/enabled-shared-cpus` resource, it rejects the pod request with an error explaining the user how to fix its pod spec.
2. It adds an annotation `cpu-shared.crio.io` that will be used to tell the runtime that shared cpus were requested.
For every container requested for shared cpus, it adds an annotation with the following scheme:
`cpu-shared.crio.io/<container name>`
Example of how it's done for core pinning: https://github.com/openshift/kubernetes/commit/04ff5090bae1cb181a2464696adde8709cdd0a93
bump cluster-config-operator to pull mixed-cpus feature-gate api
Telecommunications providers continue to deploy OpenShift at the Far Edge. The acceleration of this adoption and the nature of existing Telecommunication infrastructure and processes drive the need to improve OpenShift provisioning speed at the Far Edge site and the simplicity of preparation and deployment of Far Edge clusters, at scale.
A list of specific needs or objectives that a Feature must deliver to satisfy the Feature. Some requirements will be flagged as MVP. If an MVP gets shifted, the feature shifts. If a non MVP requirement slips, it does not shift the feature.
requirement | Notes | isMvp? |
Telecommunications Service Provider Technicians will be rolling out OCP w/ a vDU configuration to new Far Edge sites, at scale. They will be working from a service depot where they will pre-install/pre-image a set of Far Edge servers to be deployed at a later date. When ready for deployment, a technician will take one of these generic-OCP servers to a Far Edge site, enter the site specific information, wait for confirmation that the vDU is in-service/online, and then move on to deploy another server to a different Far Edge site.
Retail employees in brick-and-mortar stores will install SNO servers and it needs to be as simple as possible. The servers will likely be shipped to the retail store, cabled and powered by a retail employee and the site-specific information needs to be provided to the system in the simplest way possible, ideally without any action from the retail employee.
Q: how challenging will it be to support multi-node clusters with this feature?
< What does the person writing code, testing, documenting need to know? >
< Are there assumptions being made regarding prerequisites and dependencies?>
< Are there assumptions about hardware, software or people resources?>
< Are there specific customer environments that need to be considered (such as working with existing h/w and software)?>
< Are there Upgrade considerations that customers need to account for or that the feature should address on behalf of the customer?>
<Does the Feature introduce data that could be gathered and used for Insights purposes?>
< What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)? >
< What does success look like?>
< Does this feature have doc impact? Possible values are: New Content, Updates to existing content, Release Note, or No Doc Impact>
< If unsure and no Technical Writer is available, please contact Content Strategy. If yes, complete the following.>
< Which other products and versions in our portfolio does this feature impact?>
< What interoperability test scenarios should be factored by the layered product(s)?>
Question | Outcome |
To give Telco Far Edge customers as much of the product support lifespan as possible, we need to ensure that OCP releases are "telco ready" when the OCP release is GA.
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
No documentation required
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Implement the rteval test in the openshift-test binary under the openshift/nodes/realtime test suite
Feature Overview
Requirement | Notes | isMvp? |
---|---|---|
CI - MUST be running successfully with test automation | This is a requirement for ALL features. | YES |
Release Technical Enablement | Provide necessary release enablement details and documents. | YES |
This Section:
This Section: What does the person writing code, testing, documenting need to know? What context can be provided to frame this feature.
Questions to be addressed:
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled
The goal of this epic is to guarantee that all pods running within the ACM (Advanced Cluster Management) cluster adhere to Kubernetes Security Context Constraints (SCC). The implementation of a comprehensive SCC compliance checking system will proactively maintain a secure and compliant environment, mitigating security risks.
Ensuring SCC compliance is critical for the security and stability of a Kubernetes cluster.
A customer who is responsible for overseeing the operations of their cluster, faces the challenge of maintaining a secure and compliant Kubernetes environment. The organization relies on the ACM cluster to run a variety of critical workloads across multiple namespaces. Security and compliance are top priorities, especially considering the sensitive nature of the data and applications hosted in the cluster.
As an ACM admin, I want to add Kubernetes Security Context Constraints (SCC) V2 options to the component's resource YAML configuration to ensure that the Pod runs with the 'readonlyrootfilesystem' and 'privileged' settings, in order to enhance the security and functionality of our application.
In the resource config YAML, we need to add the follow context:
securityContext: privileged: false readOnlyRootFilesystem: true
Affected resources:
OKD users should be able to use the agent-based install method to install OKD FCOS clusters.
This should mostly work already, but there are some known differences between RHCOS and FCOS that we rely on (as evidenced by people trying to use the FCOS ISO by accident and it failing). Specifically, I vaguely recall that the semodule command used by the selinux.service may be missing from FCOS.
Ultimately we need CI testing of OKD. This may help alert us to upcoming issues in RHCOS before we encounter them in OCP.
Some of the scripts generated by the create ignition-configs command for OKD may require to discriminate the cases where SNO is being managed or agent-based installer is currently used
This epic tracks any part of our codebase / solutions we implemented taking shortcuts.
Whenever a shortcut is taken, we should add a story here not to forget to improve it in a safer and more maintainabile way.
maintanability and debuggability, and in general fighting the technical debt, is critical to keep velocity and ensure overall high quality
https://issues.redhat.com/browse/CNF-796
https://issues.redhat.com/browse/CNF-1479
https://issues.redhat.com/browse/CNF-2134
https://issues.redhat.com/browse/CNF-6745
https://issues.redhat.com/browse/CNF-8036
An epic we can duplicate for each release to ensure we have a place to catch things we ought to be doing regularly but can tend to fall by the wayside.
As a developer, I want to switch from using Chrome to Electron browser in e2e tests in the console repo. There are multiple advantages - Faster testing and native support from Cypress.
Search repos: https://github.com/search?q=org%3Aopenshift+--browser+%24%7BBRIDGE_E2E_BROWSER_NAME%3A%3Dchrome%7D&type=code
AC:
As part of the spike to determine outdated plugins, the null-loader dev dependency is out of date and needs to be updated.
Acceptance criteria:
packages/console-shared/src/test-utils includes Protractor code that is shared among static packages. Once all static packages have removed or migrated their Protractor tests, remove packages/console-shared/src/test-utils.
AC:
Remove
once all other Protractor tests have been removed or migrated
console-plugin-sdk includes https://github.com/openshift/console/tree/master/frontend/packages/console-plugin-sdk#integration-tests. We need to remove this once all Protractor tests have been migrated to Cypress or removed.
Update console frontend builder image from NodeJS version 14 to 18. This is necessary since some other dependencies requires higher version of NodeJS.
Acceptance criteria:
Findings:
To update Node.js version from 14 to 18 in the console UI builder image and get a clean run of pre-merge tests would require a workaround solution to fix the breaking change due to the change in the hashing algorithm in Node.js v17 upward. To fix the issue, we’ll have to override the hashing algorithm - “md4” which is the default in previous versions of Node.js.
Reference to previous PRs to update tectonic-console-builder image.
https://github.com/openshift/console/pull/12828
https://github.com/openshift/release/pull/39443/files
`frontend/packages/ceph-storage-plugin` has been migrated to a dynamic plugin in an external repo, so all the code there is orphaned. Included in that orphaned code are Protractor tests that need to be removed in order to complete the removal of Protractor.
AC:
As part of the spike to determine outdated plugins, the husky dev dependency is out of date and needs to be updated.
Acceptance criteria:
Note: Migration guide from v4 to v8 - https://typicode.github.io/husky/migrating-from-v4.html
Due to matured state of console it would be good to determine which of the packages need to be updated so we don't get into a point in which the update of a package would need to happen ASAP due to a CVE.
For that we should determine a list of top packages for console operator and create a story for each.
AC:
Cluster-version operator (CVO) manifests declaring a runlevel are supposed to use 0000_<runlevel>_<dash-separated-component>_<manifest_filename>, but since api#598 added 0000_10-helm-chart-repository.crd.yaml and api#1084 added 0000_10-project-helm-chart-repository.crd.yaml, those have diverged from that pattern, so the cluster-version operator will fail to parse their runlevel. They're still getting sorted into a runlevel around 10 by this code, but unless there are reasons that the CRD needs to be pushed early in an update, it gives the CVO the ability to parallelize the reconciliation with more sibling resources if you leave the runlevel prefix off in the API repository (after which these COPY lines would need a matching bump).
As part of the spike to determine outdated plugins, the i18next-parser dev dependency is out of date and needs to be updated.
Acceptance criteria:
With the migration to Cypress 10+, Cypress' config file changed. When migrating to the new config file format, Cypress automatically retained the existing plugins setup. We should update the config to import the plugins directly in to the config file and convert the file to TypeScript. See https://docs.cypress.io/guides/tooling/plugins-guide
In retrospect, we don't actually want to do this because https://github.com/openshift/console/blob/master/frontend/packages/integration-tests-cypress/plugins/index.js is shared with OLM, so directly importing will mean duplication. Additionally, there is no real benefit to converting to TypeScript, so I am reducing this story to simply removing the comments in the comments here and here
AC: Update the Cypress config for
dev-console and knative-plugin static packages have Protractor tests that utilize the views in packages/operator-lifecycle-manager/integration-tests. Once the dev-console and knative-plugin Protractor tests have been removed or migrated, remove packages/operator-lifecycle-manager/integration-tests.
AC:
We are currently running Cypress v8.5.0, which is over two years old, so we should upgrade to a newer version to stay current. Recommended upgrade is v12.17.4.
Robb created a WIP PR to vet this upgrade and diagnose and fix any issues in console and OLM tests.
A few outstanding issues with that PR:
AC:
As a developer, I want to go through the e2e tests and remove PatternFly classnames being used as selectors so that future PatternFly upgrades do not break integration tests.
AC:
Remove or migrate Protractor tests to Cypress
AC:
Patternfly is releasing the new PF5 version. Due to the new version we should bump the version in console to identify any issues related to the new version, particularly:
Acceptance criteria (needs refinement)
The existing console.page/resource/details extension point results in an entirely blank page where everything comprising the details page has to be duplicated. Rather than needing to duplicate everything, another option would be to add an extension point to the existing DefaultDetailsPage ResourceSummary so that there is no need to recreate the entire details page, but additional ResourceSummary items could be added to the existing default. Other option would be to create a new extension.
AC:
We need to enable the storage for the v1 version of our ConsolePlugin CRD in the API repository. ConsolePlugin v1 CRD was added in CONSOLE-3077.
AC: Enable the storage for the v1 version of ConsolePlugin CRD and disable the storage for v1alpha1 version
Remove code that was added thought the ACM integration into all of the console's codebase repositories
Since there was decision made stop with the ACM integration, we as a team decided that it would be better to remove the unused code in order avoid any confusion or regressions.
Remove all multicluster components, hooks, and helpers from the frontend codebase.
Remove all multicluster-related code from the console operator repo.
AC:
Remove all multicluster-related code from the console backend
AC:
This story describes Phase 1 of using OpenShift Dynamic Plugin SDK in Console.
This story is focused on plugin build-time infrastructure.
Generated @openshift-console/dynamic-plugin-sdk-webpack package should be updated as follows:
Both Console and OpenShift Dynamic Plugin SDK should be updated to address any discrepancies.
AC:
Building CI Images has recently increased in duration, sometimes hitting 2 hours, which causes multiple problems:
More importantly, the build times have gotten to a point where OSBS is failing to build the installer due to timeouts, which is making it impossible for ART to deliver the product or critical fixes.
The installer-artifacts image depends on the installer image. It seems unnecessary to copy the x86 installer binary into both images. See if we can decouple.
Previous art: https://github.com/openshift/release/pull/38975
Acceptance criteria:
Create a new repo for the providers, build them into a container image and import the image in the installer container image.
Hopefully this will save resources and decrease build times for CI jobs in Installer PRs.
We're currently on etcd 3.5.9, with new CVEs and features implemented we want to rebase to the latest release.
Golang 1.20 update
In 3.5.9 we have spent significant time to update all active releases to use go 1.19. With 3.5.10 the default version will be 1.20 (ETCD-481). We need to figure out whether it makes sense to bump the image again or rely on the go-toolbox team to give us a patched 1.19 release for the time being.
There have not been any code changes that require us to use 1.20.
Rebase openshift/etcd to latest 3.5.10 upstream release.
Rebase openshift/etcd to latest 3.5.11 upstream release.
Also remove support for DOCKER_REGISTRY_SERVICE_HOST and DOCKER_REGISTRY_SERVICE_PORT.
Most of the core openshift sub-projects APIs live today in https://github.com/openshift/api/ repo .
Description:
As a MCO developer, it is easier to update and maintain APIs and CRDs when it is co-located with other core operators.
Most of the core openshift sub-project API and CRDs live today in centralized location https://github.com/openshift/api/ . This was done as part of enhancement https://github.com/openshift/enhancements/blob/master/enhancements/api-review/centralized-manifest-openapi-generation.md and makes sense for MCO as well.
Acceptance Criteria:
More context:
I recorded 2 failures during preparing for installation (with @itsoiref help),
Failure 1 - we made the image availability validation fail and started the installation.
Result: you get an informative notification. failure-1.mp4
Failure 2 - We made an install config override that will make the installer to fail while generating the ignition.
Result: No notification (as described in the bug...)failure-2.mp4
Manage the effort for adding jobs for release-ocm-2.9 on assisted installer
https://docs.google.com/document/d/1WXRr_-HZkVrwbXBFo4gGhHUDhSO4-VgOPHKod1RMKng
Merge order:
Update the BUNDLE_CHANNELS in the Makefile in assisted-service and run bundle generation.
The external platform was created to allow cloud providers to supply their own integration components (cloud controller manager, etc.) without prior integration into openshift release artifacts. We need to support this new platform in assisted-installer in order to provide a user friendly way to enable such clusters, and to enable new-to-openshift cloud providers to quickly establish an installation process that is robust and will guide them toward success.
Please describe what conditions must be met in order to mark this feature as "done".
If the answer is "yes", please make sure to check the corresponding option.
the new API should look like (it should match the installer-config definition):
platform: {
type: external
external: {
platformName: oci
CloudControllerManager: External # in the future
}
}
When platformName is set to oci, the service must behave like "platform.type: oci".
CloudControllerManager is dependent on https://github.com/openshift/installer/pull/7457/files
Please describe what this feature is going to do.
Please describe what conditions must be met in order to mark this feature as "done".
If the answer is "yes", please make sure to check the corresponding option.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Ensure we install 4.14 redhat-operators catalog to install LSO when OCP == 4.15: https://github.com/openshift/assisted-service/pull/5323/files
There are a number of steps of initial setup and information gathering required before an the ART team can work on building a new image.
The image repository is https://github.com/openshift/kubernetes-metrics-server
Documentation
The ART team are able to start work on adding the image for kubernetes-metrics-server to the automatic build process.
Following points to consider to implement for switching to metrics-server:
Acceptance Criteria:
Based on review https://github.com/openshift/cluster-monitoring-operator/pull/2022#discussion_r1401746992, move this role to cmo's jsonnet so that we can avoid duplication in metrics-server jsonnet
Vendor openshift/api to openshift/cluster-config-operator to bring featuregate changes.
https://github.com/openshift/api/pull/1615 is merged but this change won't be available unless we vendor change in openshift/cluster-config-operator
Add checks for featuregate enable or not
Update API docs to note the TechPreview feature in MetricsServerConfig
make run-local command in cmo is using following error after Feature Gate was introduced in #2151
81243 reflector.go:325] Listing and watching *v1.FeatureGate from github.com/openshift/client-go/config/informers/externalversions/factory.go:116I1124 11:53:35.992069 81243 reflector.go:325] Listing and watching *v1.ClusterVersion from github.com/openshift/client-go/config/informers/externalversions/factory.go:116E1124 11:53:36.336227 81243 simple_featuregate_reader.go:290] cluster failed with : unable to determine features: missing desired version "" in featuregates.config.openshift.io/clusterE1124 11:53:36.336247 81243 simple_featuregate_reader.go:290] cluster failed with : unable to determine features: missing desired version "" in featuregates.config.openshift.io/clusterE1124 11:53:36.341778 81243
Proposed title of this feature request
Add scrape time jitter tolerations to prometheus
What is the nature and description of the request?
Change the configuration of the OpenShift Prometheus instances to tolerate scrape time jitters.
Why does the customer need this? (List the business requirements)
Prometheus chunk compression relies on scrape times being accurately aligned to the scrape interval. Due to the nature of delta of delta encoding, a small delay from the configured scrape interval can cause tsdb data to occupy significantly more space.
We have observed a 50% difference in on disk tsdb storage for a replicated HA pair.
The downside is a reduction in sample accuracy and potential impact to derivatives of the time series. Allowing a jitter toleration will trade off improved chunk compression for reduced accuracy of derived data like the running average of a time series.
List any affected packages or components.
Prometheus
1. Proposed title of this feature request
UserWorkLoad monitoring pods having any problem should not degrade ClusterMonitoringOperator(CMO)
2. What is the nature and description of the request?
Currently when customer has enabled UserWorkloadMonitoring(UWM) if any pods under UWM project goes down that inturn degraded CMO which should not happen as Core Monitoring is working totally fine
3. Why does the customer need this? (List the business requirements here)
IF the pods under UWM are having problem CMO degrades this creates panic for CU as one of the core CO is degrade and CU raise a high sev cases which is not at all needed as cluster monitoring is totally fine
4. List any affected packages or components.
Cluster Monitoring Operator
—
*This was the first idea but isn't what was implemented at the end, check the comments below and the document linked to https://issues.redhat.com/browse/MON-3421.*
Implement the changes presented during https://issues.redhat.com/browse/MON-3375 here https://docs.google.com/document/d/1mPpGbrR4Pv1AkBX_qRGLQPLRIx87KPv9-y91Xq1UPMg/edit?usp=sharing
.spec.remoteWrite[].sendExemplars
.spec.enableFeatures
[1]Prometheus [monitoring.coreos.com/v1]
https://docs.openshift.com/container-platform/4.13/rest_api/monitoring_apis/prometheus-monitoring-coreos-com-v1.html
However "PrometheusRestrictedConfig" doesn't support both options.
Enable sending exemplars over remote-write in UWM. This may additionally require setting enableFeatures, and some other defaults (MON-3501).
https://github.com/rhobs/handbook/pull/59/files
https://github.com/openshift/cluster-monitoring-operator/pull/1631
https://github.com/openshift/origin/pull/27031
https://github.com/openshift/cluster-monitoring-operator/pull/1580
https://github.com/openshift/cluster-monitoring-operator/pull/1552
Require read-only access to Alertmanager in developer view.
https://issues.redhat.com/browse/RFE-4125
Common user should not see alerts in UWM.
https://issues.redhat.com/browse/OCPBUGS-17850
Related ServiceAccounts.
Interconnection diagram in monitoring stack.
https://docs.google.com/drawings/d/16TOFOZZLuawXMQkWl3T9uV2cDT6btqcaAwtp51dtS9A/edit?usp=sharing
None.
In CMO, Thanos Querier pods have an Oauth-proxy on port 9091 for web access on all paths.
We are going to replace it with kube-rbac-proxy.
The current behavior is allow access to the Thanos Querier web server for any user having "get" access to "namespace" resources. We do not have to keep the same logic but have to make sure no regression happen.
We use the subresource "prometheus/api" to authorize both "post" and "get" HTTP requests to kube-rbac-proxy.
We update the cluster role "cluster-monitoring-view" with new access priviledges and prepare a new role for api access.
In https://issues.redhat.com/browse/MON-1949 we added a feature to deploy an additional kubelet service monitor that adds cAdvisor metrics with the exposed timestamps. This improves consistency of some container metrics, at the cost of delayed staleness awareness.
With https://github.com/prometheus/prometheus/pull/13060/ we can finally ask prometheus to ingest explicit timestamps but also use staleness markers for low lantency stale detection. I.e. we can deprecate and remove the dedicates ServiceMonitor feature when this Prometheus feature ships in OCP.
By setting honorTimestamps: true and trackTimestampsStaleness: true We can now get the correct cAdvisor timestamps with the default staleness handling.
This should be done for both cAdvisor SMs in assets/control-plane/service-monitor-kubelet.yaml and assets/control-plane/minimal-service-monitor-kubelet.yaml.
1. Proposed title of this feature request
Support NodeClockNotSynchronising when NTP service is disabled but OCP cluster is using PTP-Operator for time sync
2. What is the nature and description of the request?
Currently, when install an SNO OCP Telco Cluster and use the time sync refference from ptp-operator, alertmanager service is going to fire an alert continously entitled `NodeClockNotSynchronising` because the alerting rule its checking only if the NTP sync exists on the system as described in here.
3. Why does the customer need this? (List the business requirements here)
Since we are providing two time syncronization systems this should cover the use of both systems so the system is not firing false-positive alerts. In the above describe situation the systems are time sync using ptp-operator BUT the system is sending alerts for not beying synchronised.
4. List any affected packages or components.
When the PTP operator is installed, it brings its own alerting rule to detect clock drift which is more reliable than the out-of-the-box NodeClockNotSynchronising and NodeClockSkewDetected alerts:
The NodeClockNotSynchronising PromQL expression should be adjusted to "mute" itself when the PTP operator is installed.
expr: | ( min_over_time(node_timex_sync_status{job="node-exporter"}[5m]) == 0 and node_timex_maxerror_seconds{job="node-exporter"} >= 16 ) # addition to the upstream expression and on() absent(up\{job="ptp-monitor-service"})
If this Epic is an RFE, please complete the following questions to the best of your ability:
Q1: Proposed title of this RFE
Q2: What is the nature and description of the RFE?
openshift-install currently takes an install-config.yaml parameter for `hyperthreading`. On Power it is not a boolean. Threading can be set to 1/2/4/8.
Q3: Why does the customer need this? (List the business requirements here)
Some workloads do not scale well with the default level of threads on Power (8). If the workload works better with SMT=4, customer would prefer to setup the parameter during install.
Q4: List any affected packages or components
openshift-install
hyperthreading parameter in install-config.yaml
Research development required for support of setting SMT levels on Power hardware in Openshift
Epic Goal
Running doc to describe terminologies and concepts which are specific to Power VS - https://docs.google.com/document/d/1Kgezv21VsixDyYcbfvxZxKNwszRK6GYKBiTTpEUubqw/edit?usp=sharing
https://github.com/IBM-Cloud/power-go-client/blob/master/clients/instance/ibm-pi-datacenters.go and https://github.com/IBM-Cloud/power-go-client/blob/master/clients/instance/ibm-pi-workspaces.go were recently added to power-go-client
1. Upgrade power-go-client in the installer to a version where these are available
2. Add check that uses these to query if PER is available. Fail if not.
mad02, mad04, and wdc06 will all GA with PER by end of the year.
Currently all regions have available sysTypes of s922 or e980
However with mad02, a sysType of s1022 will be the only option.
Because of the short window of 4.15 I'm suggesting that we hardcode the values for available system types into powervs_regions.go however I am open to a solution that queries the region for sysType if there is an easy way to do it. Depending on the complexity that may have to wait for 4.16.
For now, I think we should:
1) Add sysType array to powervs_regions.go
2) Update the sysType validation to ensure that sysType is in the array (rather than checking the static map we have today)
3) Ensure that this value will be passed into terraform. I believe it is today, but please confirm.
In 4.15 we don't want to support cloud connection zones. This will initially reduce the number of zones we can use, but we will add zones as they come online.
Here is our overall tech debt backlog: ODC-6711
See included tickets, we want to clean up in 4.15.
As a developer, I want to replace the current topology graphic with the new designs created by the ux team. The info for these graphics located here:
Source files for the new OpenShift artwork (the "O").
Guidelines on usage can be found here.
The protractor was end-of-life as of September 2023. As a result, we need to remove the Protractor from the console. In order to do so, we either need to remove any existing Protractor tests or migrate them to Cypress.
As a user, I want to use the latest version of the Cypress
Cypress has been upgraded in PR https://github.com/openshift/console/pull/13070 and the e2e tests has been disbaled for all the packages own by ODC as it will break the CI.
Improve CI coverage for E2E tests and test stablization for better product health
Improving the health of CI and devconsole which will impact PR review effectiveness.
Description of problem:
Run topology package against CI and fix any discrepancy with tests
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Description of problem:
CU cluster of the Mavenir deployment has cluster-node-tuning-operator in a CrashLoopBackOff state and does not apply performance profile
Version-Release number of selected component (if applicable):
4.14rc0 and 4.14rc1
How reproducible:
100%
Steps to Reproduce:
1. Deploy CU cluster with ZTP gitops method 2. Wait for Policies to be complient 3. Check worker nodes and cluster-node-tuning-operator status
Actual results:
Nodes do not have performance profile applied cluster-node-tuning-operator is crashing with following in logs: E0920 12:16:57.820680 1 runtime.go:79] Observed a panic: &runtime.TypeAssertionError{_interface:(*runtime._type)(nil), concrete:(*runtime._type)(nil), asserted:(*runtime._type)(0x1e68ec0), missingMethod:""} (interface conversion: interface is nil, not v1.Object) goroutine 615 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1c98c20?, 0xc0006b7a70}) /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc000d49500?}) /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75 panic({0x1c98c20, 0xc0006b7a70}) /usr/lib/golang/src/runtime/panic.go:884 +0x213 github.com/openshift/cluster-node-tuning-operator/pkg/util.ObjectInfo({0x0?, 0x0}) /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/util/objectinfo.go:10 +0x39 github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*ProfileCalculator).machineConfigLabelsMatch(0xc000a23ca0?, 0xc000445620, {0xc0001b38e0, 0x1, 0xc0010bd480?}) /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/profilecalculator.go:374 +0xc7 github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*ProfileCalculator).calculateProfile(0xc000607290, {0xc000a40900, 0x33}) /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/profilecalculator.go:208 +0x2b9 github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*Controller).syncProfile(0xc000195b00, 0x0?, {0xc000a40900, 0x33}) /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/controller.go:664 +0x6fd github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*Controller).sync(0xc000195b00, {{0x1f48661, 0x7}, {0xc000000fc0, 0x26}, {0xc000a40900, 0x33}, {0x0, 0x0}}) /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/controller.go:371 +0x1571 github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*Controller).eventProcessor.func1(0xc000195b00, {0x1dd49c0?, 0xc000d49500?}) /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/controller.go:193 +0x1de github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*Controller).eventProcessor(0xc000195b00) /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/controller.go:212 +0x65 k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?) /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:226 +0x3e k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0x0?, {0x224ee20, 0xc000c48ab0}, 0x1, 0xc00087ade0) /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:227 +0xb6 k8s.io/apimachinery/pkg/util/wait.JitterUntil(0x0?, 0x3b9aca00, 0x0, 0x0?, 0xc0004e6710?) /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:204 +0x89 k8s.io/apimachinery/pkg/util/wait.Until(0xc0004e67d0?, 0x91af86?, 0xc000ace0c0?) /go/src/github.com/openshift/cluster-node-tuning-operator/vendor/k8s.io/apimachinery/pkg/util/wait/backoff.go:161 +0x25 created by github.com/openshift/cluster-node-tuning-operator/pkg/operator.(*Controller).run /go/src/github.com/openshift/cluster-node-tuning-operator/pkg/operator/controller.go:1407 +0x1ba5 panic: interface conversion: interface is nil, not v1.Object [recovered] panic: interface conversion: interface is nil, not v1.Object
Expected results:
cluster-node-tuning-operator is functional, performance profiles applied to worker nodes
Additional info:
There is no issue on a DU node of the same deployment coming from same repository, DU node is configured as requested and cluster-node-tuning-operator is functioning correctly. must gather from rc0: https://drive.google.com/file/d/1DlzrjQiKTVnQKXdcRIijBkEKjAGsOFn1/view?usp=sharing must gather from rc1: https://drive.google.com/file/d/1qSqQtIunQe5e1hDVDYwa90L9MpEjEA4j/view?usp=sharing performance profile: https://gitlab.cee.redhat.com/agurenko/mavenir-ztp/-/blob/airtel-4.14/policygentemplates/group-cu-mno-ranGen.yaml
Moad Zardab to fill out with something useful.
What
We appear to be missing some expected metrics for the telemeter-staging and telemeter-production namespaces.
This task is around identifying a list of missing metrics we need to identify this issue in the future:
When quorum breaks and we are able to get a snapshot of one of the etcd members, we need a procedure to restore the etcd cluster for a given HostedCluster.
Documented here: https://docs.google.com/document/d/1sDngZF-DftU8_oHKR70E7EhU_BfyoBBs2vA5WpLV-Cs/edit?usp=sharing
Add the above documentation to the HyperShift repo documentation.
We need to Bump the version of K8 and to run a library sync for OCP4.13 .Two stories will be created for each activity
We need to bump the Kubernetes Version. To the latest API version OCP is using.
This what was done last time:
https://github.com/openshift/cluster-samples-operator/pull/409
Find latest stable version from here: https://github.com/kubernetes/api
This is described in wiki: https://source.redhat.com/groups/public/appservices/wiki/cluster_samples_operator_release_activities
As a Sample Operator Developer, I will like to run the library sync process, so the new libraries can be pushed to OCP 4.15
This is a runbook we need to execute on every release of OpenShift
NA
NA
NA
Follow instructions here: https://source.redhat.com/groups/public/appservices/wiki/cluster_samples_operator_release_activities
Library Repo
Library sync PR is merged in master
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Unknown
Verified
Unsatisfied
We need to Bump the version of K8 and to run a library sync for OCP4.13 .Two stories will be created for each activity
As a Sample Operator Developer, I will like to run the library sync process, so the new libraries can be pushed to OCP 4.16
This is a runbook we need to execute on every release of OpenShift
NA
NA
NA
Follow instructions here: https://source.redhat.com/groups/public/appservices/wiki/cluster_samples_operator_release_activities
Library Repo
Library sync PR is merged in master
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Unknown
Verified
Unsatisfied
Epic Goal*
Improve automation of the OCP platform. E.g. implement new oc [adm] commands improving the image release or release nodes process. Addressing technical debt.
Why is this important? (mandatory)
Improving our efficiency and productivity
Scenarios (mandatory)
Variable
Dependencies (internal and external) (mandatory)
Variable
Contributing Teams(and contacts) (mandatory)
Technical
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
When the release controller shows the changelog between two releases, it ends up assuming a mentioned PR belongs to the fork, and makes a link
to it, despite the PR actually belonging to upstream. It gets this data from `oc`, by running something like:
```
oc adm release info --changelog=/tmp/git/ registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-09-26-111919 registry.ci.openshift.org/ocp/release:4.12.0-0.nightly-2022-09-28-204419
```
If you look at that payload, you'll see the upstream kubernetes/kubernetes PR's are assumed to have belonged to openshift/kubernetes instead. This creates incorrect data, and sometimes links to wrong PR when a PR with the same number exists in both repos.
Is there a way to only show PR's from the release payload repo?
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were not completed when this image was assembled
Description
This epic tracks install server errors which require investigation into whether the RP or ARO-Installer can be more resilient or return a User Error instead of a Server Error.
This lives under shift improvement as it will reduce the number of incidents we get from customers due to `Internal Server Error` being returned.
How to Create Stories Under Epic
During the weekly SLO/SLA/SLI meeting, we will examine install failures on the cluster installation failure dashboard. We will aggregate the top occurring items which show up as Server Errors, and each story underneath will be an investigation required to figure out root cause and how we can either prevent it, be resilient to failures, or return a User Error.
AS AN ARO SRE
I WANT either{}
SO THAT
1. Decorate the error in the ARO installer to return a more informative message as to why the SKU was not found.
2. Ensure that the OpenShift installer and the RP are validating the SKU in the same manner
3. If the validation is the same between the Installer and ARO installer, we have the option to remove the ARO installer validation step
Acceptance Criteria
Given: The RP and the ARO Installer validates the SKU in the same manner
When: The RP validates
Then: The ARO Installer does not
Given: The ARO Installer validates additional or improved information validation than the RP
When: The ARO Installer validation fails due to missing SKU (failed validation)
Then: Enhance the log to include the SKU that was not found, providing us with more information to troubleshoot
Breadcrumbs
Epic Goal*
Transition the automated backup feature added in ETCD-81 to GA.
Why is this important? (mandatory)
Providing a way to enable automated backups would improve the disaster recovery outcomes by increasing the likelihood that admins have a recent etcd backup saved for their cluster.
The feature added is in TP since 4.14, we should make it available for everyone by default.
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
We have a currently unused golang utility to take backup snapshots in `cluster-etcd-operator cluster-backup`. After a cursory look, it works.
We should prefer the golang utility for better testability for our backup feature instead.
This story covers bumps of k8s and openshift packages on both operator and operand.
After Layering and Hypershift GAs in 4.12, we need to remove the code and builds that are no longer associated with mainline OpenShift.
This describes non-customer facing.
Removing code from MCO where we maybe referencing machine-os-content such as legacy OS update.
This is intended to be a place to capture general "tech debt" items so they don't get lost. I very much doubt that this will ever get completed as a feature, but that's okay, the desire is more that stores get pulled out of here and put with feature work "opportunistically" when it makes sense.
If you find a "tech debt" item, and it doesn't have an obvious home with something else (e.g. with MCO-1 if it's metrics and alerting) then put it here, and we can start splitting these out/marrying them up with other epics when it makes sense.
During my work on Project Labrador, I learned that there are advanced caching directives that one can add to their Containerfiles. These do things such as allowing the package manager cache be kept out of the image build, but to remain after the build so that subsequent builds don't have to download the packages. Golang has a great incremental build story as well, provided that one leaves the caches intact.
To begin with, my Red Hat-issued ThinkPad P16v takes approximately 2 minutes and 42 seconds to perform an MCO image build (assuming the builder and base images are already prefetched).
A preliminary test shows that by using advanced caching directives, incremental builds can be reduced to as little as 45 seconds. Additionally, by moving the nmstate binary installation into a separate build stage and limiting what files are copied into that stage, we can achieve a cache hit under most conditions. This cache hit has the additional advantage of that stage not requiring VPN in order to reach the appropriate RPM repository.
Done When:
As an OpenShift developer, I want to know that my code is as secure as possible by running static analysis on each PR.
Periodically, scans are performed on all OpenShift repositories and the container images produced by those repositories. These scans usually result in numerous OCP bugs being opened into our queue (see linked bugs as an example), putting us in a more reactive state. Instead, we can perform these scans on each PR by following these instructions https://docs.ci.openshift.org/docs/how-tos/add-security-scanning/ to add this to our OpenShift CI configurations.
Done When:
This came about on a PR about the memory limit bump we did for the bootstrap controller pod. As per Openshift guidelines, we should only be setting requests and not limits.
Done when:
This is not a user-visible change.
METAL-119 provides the upstream ironic functionality
to avoid mistakes like using hashes from wrong releases we need to have a way to test them
ideally this should also be automated in a CI job
in CI and local builds if REMOTE_SOURCES and REMOTE_SOURCES_DIR are not defined they assume the value of . actually enabling the COPY . . command in the Dockerfile
to avoid potential issues we need to find an alternative and defaulting REMOTE_SOURCES to something more safe
having the cachito configuration in place we can now start converting the ironic packages and the dependencies to install from source
as it was done for the ironic-image, we'd like to migrate the ironic-agent-image to an hybrid model using RPMs for dependencies and source code for the ironic-python-agent installation
since we're going toward a source-based model for the OCP ironic-image using downstream packages, we're starting to see more and more discrepancies with the OKD version based on CS9 and upstream packages, causing conflicts and issues due to missing or too old dependencies
for this reason we'd like to split the lists of installed packages between OCP and OKD as it was done for the ironic-agent-image
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
The new short URLs, once supported by the image service, need to be generated by assisted-service and served as part of its API.
We would like to check possibility to avoid reboot after first boot.
Make changes to avoid reboot by MCO after first boot during installation.
Either provide a solution that avoids reboot, or find why it cannot be done.
Make all the necessary changes for implementation and check all edge cases.
No
Feature origin (who asked for this feature?)
Pass flag through install command to indicate if avoiding MCO reboot logic should be applied by the assisted installer
If skip mco reboot is enabled (passed as argument), skip mco reboot is applied. This logic consists from:
The epic MGMT-9915 added support for dual-stack VIPs, while keeping backward compatibility with the singular API and Ingress VIP fields, as specified in enhancements/dual-stack-vips.md.
With MGMT-12678, which is a part of the above-mentioned epic, API and Ingress VIP got marked as deprecated, and this Epic is about removing them from the API completely while keeping the plural VIPs in place.
Both api_vip and ingress_vip are no longer a part of the API or the DB.
Yes.
In MGMT-12678, api_vip and ingress_vip got deprecated.
With the api_vips and ingress_vips already merged and operational, the singular VIP fields should be removed.
When enabling infrastructure operator automatically import the cluster and enable users to add nodes to self cluster via Infrastructure operator
Yes, it's a new functionality that will need to be documented
In MGMT-15704 we added a local-cluster-import feature to add the local-cluster to the infrastructure operator.
We now require an "end to end" test that will make sure that a node may be added to the cluster without issues.
Please describe what this feature is going to do.
Please describe what conditions must be met in order to mark this feature as "done".
If the answer is "yes", please make sure to check the corresponding option.
When the ai0e2e-operator-ztp-capi fails we don't collect the relevant logs for debugging the failure.
We should collect:
We used to get all these logs in the past but now we are getting this instead:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-assisted-service-master-edge-e2e-ai-operator-ztp-capi-periodic/1657542530343899136/artifacts/e2e-ai-operator-ztp-capi-periodic/assisted-baremetal-operator-gather/artifacts/hypershift/
I created a SNO cluster through the SaaS. A minor issue prevented one of the ClusterOperators from reporting "Available = true", and after one hour, the install process hit a timeout and was marked as failed. When I found the failed install, I was able to easily resolve the issue and get both the ClusterOperator and ClusterVersion to report "Available = true", but the SaaS no longer cared; as far as it was concerned, the installation failed and was permanently marked as such. It also would not give me the kubeadmin password, which is an important feature of the install experience and tricky to obtain otherwise. A hard timeout can cause more harm than good, especially when applied to a system (openshift) that continuously tries to get to desired state without an absolute concept of an operation succeeding or failing; desired state just hasn't been achieved yet. We should consider softening this timeout to be a warning that installation hasn't achieved completion as quickly as expected, without actively preventing a successful outcome.
Late binding scenario (kube-api):
User try to install a cluster with late binding featured enabled (deleting the cluster will return the hosts to InfraEnv), installation timeout and cluster goes into error state, user connect to the cluster and fix the issue.
AI will still think that there is an error in the cluster, If user will try to perform day2 operations on an in error cluster it will fail, the only option is to delete the cluster and create another one that is marked as installed but that will cause the host to boot from discovery ISO.
Epic Goal
Why is this important?
Scenarios
1. …
Acceptance Criteria
Dependencies (internal and external)
1. …
Previous Work (Optional):
1. …
Open questions::
1. …
Done Checklist
Currently various serviceIDs are created by create infra command to be used by various cluster operators, in which storage operator is exposed in guest cluster which should not happen. Need to reduce the scope of all the serviceIDs to specific to infra resources created for that cluster alone.
With respect to this comment https://github.com/openshift/cluster-api-provider-ibmcloud/blob/main/main.go#L168C2-L175 rename --powervs-provider-id-fmt param to --provider-id-fmt in capi deployment.
Epic Goal
Why is this important?
Additional Context
Acceptance Criteria
Epic Goal
`sao04`,`wdc07`,`eu-de-1`, and `eu-de-2` have all added PER capability. Add them to the installer
Filing a ticket based on this conversation here: https://github.com/openshift/enhancements/pull/1014#discussion_r798674314
Basically the tl;dr here is that we need a way to ensure that machinesets are properly advertising the architecture that the nodes will eventually have. This is needed so the autoscaler can predict the correct pool to scale up/down. This could be accomplished through user driven means like adding node arch labels to machinesets and if we have to do this automatically, we need to do some more research and figure out a way.
For autoscaling nodes in a multi-arch compute cluster, node architecture needs to be taken into account because such a cluster could potentially have nodes of upto 4 different architectures. Labels can be propagated today from the machineset to the node group, but they have to be injected manually.
This story explores whether the autoscaler can use cloud provider APIs to derive the architecture of an instance type and set the label accordingly rather than it needing to be a manual step.
The user experience of DPU network operator should be improved. Today, there are a lot of steps that need to be followed precisely. Much of those steps can be handled by the DPU network operator.
The DPU could can 2 different en-cap ip (one from the x86 host and one from the ARM DPU). There should always be one encap-ip in the OVN-K virtual network topology.
As a developer I want consistent build versions so that when the release team is organizing master images they don't encounter errors due to missing git tags.
See OCPCLOUD-2173 for background.
See https://github.com/openshift/cluster-version-operator/blob/master/hack/build-go.sh for an example implementation.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Setting up the distgit (for production brew builds) depends on the ART team. This should be tackled early.
The PR to ocp-build-data should also be prioritised, as it blocks the PR to openshift/os. There is a separate CI Mirror used to run CI for openshift/os in order to merge, which can take a day to sync.
As a user I want kubelet to know how to authenticate with acr automatically so that I don't have to roll credentials every 12h
This functionality is being removed in tree from the kubelet, so we now need to provide it via a credential provider plugin
Before this can be completed, we will need to create and ship an rpm within RHCOS to provide the binary kubelet will exec.
See https://github.com/openshift/machine-config-operator/pull/4103/files for an example PR
As a user I want kubelet to know how to authenticate with gcr automatically so that I don't have to roll credentials every 12h
This functionality is being removed in tree from the kubelet, so we now need to provide it via a credential provider plugin
Before this can be completed, we will need to create and ship an rpm within RHCOS to provide the binary kubelet will exec.
See https://github.com/openshift/machine-config-operator/pull/4103/files for an example PR
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
The kubelet no longer needs the cloud-config flag as it is no longer running in-tree code.
It is currently handled in the templates by this function which will need to be removed, along with any instances in the templates that call the function.
This should cause the flag to be omitted from future kubelet configuration.
Code in library-go currently uses feature gates to determine if Azure and GCP clusters should be external or not. They have been promoted for at least one release and we do not see ourselves going back.
In 4.17 the code is expected to be deleted completely.
We should remove the reliance on the feature gate from this part of the code and clean up references to feature gate access at the call sites.
When a new OCP release branch is cut, there are a number of things that need to be updated manually to point to the new release.
Update Red Hat and Certified Operators index image tags:
These catalogs need to be created first before we do this work.
Update community operators index image tags:
These catalogs need to be created first before we do this work.
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
This is a response to https://issues.redhat.com/browse/OCPBUGS-12893 which is more of a feature request than a bug. The ask is that we put the ingress VIP in a fault state when there is no ingress controller present on the node so it won't take the VIP even if no other node takes it either.
I don't believe this is a bug because it's related to an unsupported configuration, but I still think it's worth doing because it will simplify our remote worker process. If we put the VIP in a fault state it won't be necessary to disable keepalived on remote workers, the ingress service just needs to be placed correctly and keepalived will do the right thing.
We will use this to address tech debt in OLM in the 4.10 timeframe.
Items to prioritize are:
CI e2e flakes
The operator framework portfolio consists of a number of projects that
have an upstream project. The verify-commits makefile target had been
introduced in each of the downstream projects to assist with the
downstreaming efforts. The commit checking logic was eventually moved
into a container and was rolled out as the verify-commits-shared test,
which ran alongside the verify-commits test to ensure that it worked
as expected.
We are now confident that the verify-commits-shared test is running as
expected, and can replace the logic used in the verify-commits test with
that of the verify-commits-shared test, which will allow us to remove
the verify-commits Makefile targets from each of the repos, simplifying
the code base and reducing duplication of code.
The operator framework portfolio consists of a number of projects that
have an upstream project. The verify-commits makefile target had been
introduced in each of the downstream projects to assist with the
downstreaming efforts. The commit checking logic was eventually moved
into a container and was rolled out as the verify-commits-shared test,
which ran alongside the verify-commits test to ensure that it worked
as expected.
We are now confident that the verify-commits-shared test is running as
expected, and can replace the logic used in the verify-commits test with
that of the verify-commits-shared test, which will allow us to remove
the verify-commits Makefile targets from each of the repos, simplifying
the code base and reducing duplication of code.
The operator framework portfolio consists of a number of projects that
have an upstream project. The verify-commits makefile target had been
introduced in each of the downstream projects to assist with the
downstreaming efforts. The commit checking logic was eventually moved
into a container and was rolled out as the verify-commits-shared test,
which ran alongside the verify-commits test to ensure that it worked
as expected.
We are now confident that the verify-commits-shared test is running as
expected, and can replace the logic used in the verify-commits test with
that of the verify-commits-shared test, which will allow us to remove
the verify-commits Makefile targets from each of the repos, simplifying
the code base and reducing duplication of code.
What
RHEL machines need to be able to ship CPU info metrics into MST to support metering
Why
To support billing customers as they convert from CentOS 7 to RHEL
Useful links
RHEL Observability - PoC (Google doc)
RHOBS Rhelemeter PoC (google doc)
What
As per this slack thread - the RHEL team are trying to route requests to the "rhelemeter" instance through a vendor (Akamai). There is some issues with the existing mTLS setup because it looks like the service terminates SSL and does not propagate the client certificate in the current setup.
Ideally we transmit traffic through this service in order to maintain the allow list on RHEL machines that have their firewalls open to console.redhat.* domains.
This task is around investigating if we can enable rhelemeter to read the cert details from a different header that is encrypted using some shared key
We should also consider the possibility of enabling ip address range on the route itself.
Figure out what kind of features Akami has, can it support this out of the box and if it can, why cant we enable that before deciding on making the code change?
Acceptance Criteria
Remark that:
Slack thread:
https://redhat-internal.slack.com/archives/CCX9DB894/p1696515395395939
Acceptance criteria
Epic is used to accumulate all Tech Debt Issues, and later move them into different Release-aimed Tech Debt Epics
cno go.mod is currently very to tricky to work with due to hypershift deps. We can remove these deps and instead reference a dynamic api object
Issues in the upstream ovnk repo are not reflected anywhere on jira and it's time-consuming to do that manually for each issue we have. Let's automate that with a script.
Description of problem:
Failed to install OCP on the below LZ/WLZ, the common point in the below regions is that all of them have only one type of zones: LZ or WLZ. e.g. in af-south-1, only LZ is available, no WL, in ap-northeast-2, only WL is available, no LZ. Failed regions/zones: af-south-1 ['af-south-1-los-1a'] failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Wavelength Zone names: no zones with type wavelength-zone in af-south-1 ap-south-1 ['ap-south-1-ccu-1a', 'ap-south-1-del-1a'] level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Wavelength Zone names: no zones with type wavelength-zone in ap-south-1 ap-southeast-1 ['ap-southeast-1-bkk-1a', 'ap-southeast-1-mnl-1a'] level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Wavelength Zone names: no zones with type wavelength-zone in ap-southeast-1 me-south-1 ['me-south-1-mct-1a'] level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Wavelength Zone names: no zones with type wavelength-zone in me-south-1 ap-southeast-2 ['ap-southeast-2-akl-1a', 'ap-southeast-2-per-1a'] level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Wavelength Zone names: no zones with type wavelength-zone in ap-southeast-2 eu-north-1 ['eu-north-1-cph-1a', 'eu-north-1-hel-1a'] level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Wavelength Zone names: no zones with type wavelength-zone in eu-north-1 ap-northeast-2 ['ap-northeast-2-wl1-cjj-wlz-1', 'ap-northeast-2-wl1-sel-wlz-1'] level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Local Zone names: no zones with type local-zone in ap-northeast-2 ca-central-1 ['ca-central-1-wl1-yto-wlz-1'] level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Local Zone names: no zones with type local-zone in ca-central-1 eu-west-2 ['eu-west-2-wl1-lon-wlz-1', 'eu-west-2-wl1-man-wlz-1', 'eu-west-2-wl2-man-wlz-1'] level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[1].platform.aws: Internal error: getting Local Zones: unable to retrieve Local Zone names: no zones with type local-zone in eu-west-2
Version-Release number of selected component (if applicable):
4.15.0-rc.3-x86_64
How reproducible:
Steps to Reproduce:
1) install OCP on above regions/zones
Actual results:
See description.
Expected results:
Don't check LZ's availability while installing OCP in WLZ Don't check WLZ's availability while installing OCP in LZ
Additional info:
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
The installer is using a very old reference to the machine-config project for apis.
Their apis have been moved to openshift/api. Update the imports and unit tests.
This will make a import error go away and make it easier to update in the future.
We have identified gaps in our attempted test coverage that monitors for acceptable Alerts firing during cluster upgrades that need to be addressed to make sure we are not allowing regressions into the product.
This epic is to group that work.
This is a goal to help Service Delivery with unmanageable alerts during upgrades. Reach out to SD and get an initial list of alerts that are causing problems, then see how they are appearing in our data.
If they do appear in CI, work out a testing framework for specific alerts that we can make flake/fail. This might be a threshold, but I suspect more likely will be things that shouldn't fire at all.
Add test, make it a flake, file the bug for relevant team, and graduate the test to a failure once the issue is resolved.
Christoph Blecker could you get us a list of 3-5 of your top problematic alerts?
Epic Goal*
What is our purpose in implementing this? What new capability will be available to customers?
Placeholder for any technical debt for the team. As we are required to have an epic for each story and a technical debt can vary from minor code changes into complex refactoring, creating an epic card for each such actions creates a lot of paperwork which complicates any such activity. E.g. identifying scope and providing corresponding description in connection to existing features and other cards. We need a very simple way to organize activities that fall under any improvements that make the overall maintenance and design better.
Why is this important? (mandatory)
What are the benefits to the customer or Red Hat? Does it improve security, performance, supportability, etc? Why is work a priority?
To help the team reduce the paperwork.
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Varies
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
followup to https://issues.redhat.com/browse/WRKLDS-487 and https://issues.redhat.com/browse/WRKLDS-592
remove duplicate passing of kubeconfig into route-controller-manager in hypershift
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled
Please review the following PR: https://github.com/openshift/cluster-bootstrap/pull/101
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When we create a MC that declares the same kernel argument twice, MCO is adding it only once.
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-0.nightly-2023-09-22-181920 True False 5h18m Cluster version is 4.12.0-0.nightly-2023-09-22-181920 We have seen this behavior in 4.15 too 4.15.0-0.nightly-2023-09-22-224720
How reproducible:
Always
Steps to Reproduce:
1. Create a MC that declares 2 kernel arguments with the same value (z=4 is duplicated) apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: test-kernel-arguments-32-zparam spec: config: ignition: version: 3.2.0 kernelArguments: - y=0 - z=4 - y=1 - z=4
Actual results:
We get the following parameters $ oc debug -q node/sergio-v12-9vwrc-worker-c-tpbvh.c.openshift-qe.internal -- chroot /host cat /proc/cmdline BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-a594b3a14778ce39f2b42ddb90e933c1971268a746ef1678a3c6eedee5a21b00/vmlinuz-4.18.0-372.73.1.el8_6.x86_64 ostree=/ostree/boot.0/rhcos/a594b3a14778ce39f2b42ddb90e933c1971268a746ef1678a3c6eedee5a21b00/0 ignition.platform.id=gcp console=ttyS0,115200n8 root=UUID=e101e976-e029-411d-ad71-6856f3838c4f rw rootflags=prjquota boot=UUID=75598fe5-c10d-4e95-9747-1708d9fe6a10 console=tty0 y=0 z=4 y=1 There is only one "z=4" parameter. We should see "y=0 z=4 y=1 z=4" instead of "y=0 z=4 y=1"
Expected results:
In older versions we can see that the duplicated parameters are created For example, this is the output in a IPI on AWS 4.9 cluster $ oc debug -q node/ip-10-0-189-69.us-east-2.compute.internal -- chroot /host cat /proc/cmdline BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-e1eeff6ec1b9b70a3554779947906f4a7fb93e0d79fbefcb045da550b7d9227f/vmlinuz-4.18.0-305.97.1.el8_4.x86_64 random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ostree=/ostree/boot.1/rhcos/e1eeff6ec1b9b70a3554779947906f4a7fb93e0d79fbefcb045da550b7d9227f/0 ignition.platform.id=aws root=UUID=ed307195-b5a9-4160-8a7a-df42aa734c28 rw rootflags=prjquota y=0 z=4 y=1 z=4 All the parameters are created, including the duplicated "z=4".
Additional info:
In multiple datacenter/zonal deployments, the csi driver seems to be crashing with - https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_vsphere-problem-detector/139/pull-ci-openshift-vsphere-problem-detector-master-e2e-vsphere-zones/1728081801684979712/artifacts/e2e-vsphere-zones/gather-extra/artifacts/pods/openshift-cluster-csi-drivers_vmware-vsphere-csi-driver-controller-574c8c86db-cs8gh_csi-driver.log
with error being:
{"level":"error","time":"2023-11-24T17:16:30.532383276Z","caller":"service/driver.go:203","msg":"failed to run the driver. Err: +failed to update cache with topology information. Error: failed to get vCenterInstance for vCenter Host: \"vcs8e-vc.ocp2.dev.cluster.com\". Error: virtual center was already registered","TraceId":"da5779b6-e99a-475b-b300-350dfa441f1e","stacktrace":"..."}
Link to failing build - https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_vsphere-problem-detector/139/pull-ci-openshift-vsphere-problem-detector-master-e2e-vsphere-zones/1728081801684979712
Description of problem:
OCPBUGS-29424 revealed that setting the node status update frequency in kubelet (introduced with OCPBUGS-15583) causes a lot of control plane CPU. The reason is the increased frequency of kubelet node status updates will trigger second order effects in all control plane operators that usually trigger on node changes (api server, etcd, PDB guard pod controllers, or any other static pod based machinery). Reverting the code in OCPBUGS-15583, or manually setting the report/status frequency to 0s causes the CPU to drop immediately.
Version-Release number of selected component (if applicable):
Versions where OCPBUGS-15583 was backported. This includes 4.16, 4.15.0, 4.14.8, 4.13.33, and the next 4.12.z likely 4.12.51.
How reproducible:
always
Steps to Reproduce:
1. create a cluster that contains a fix for OCPBUGS-15583 2. observe the apiserver metrics (eg rate(apiserver_request_total[5m])), those should show abnormal values for pod/configmap GET alternatively the rate of node updates is increaed (rate(apiserver_request_total{resource="nodes", subresource="status", verb="PATCH"}[1m]))
Actual results:
the node status updates every 10s, which causes high CPU usage on control plane operators and apiserver
Expected results:
the node status should not update that frequently, meaning the control plane CPU usage should go down again
Additional info:
slack thread with the node team: https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1708429189987849
Description of problem:
Once a user makes a change to the log component from master node's log section, then the user is unable to change or select a different log component from the dropdown.
To make different log component selection , the user needs to revisit the logs section under master node again and this refreshes the pane and reloads to default options.
Version-Release number of selected components (if applicable):
4.11.0-0.nightly-2022-08-15-152346
How reproducible:
Always
Steps to Reproduce:
Actual results:
Unable to select or change the log component once the user already made a selection from the dropdown under master nodes' logs section.
Expected results:
Users should be allowed to change/select the log component from master node's logs section whenever required with the help of available dropdown.
Additional info:
Reproduced in both chrome[103.0.5060.114 (Official Build) (64-bit)] and firefox[91.11.0esr (64-bit)] browsers
Attached screen capture for the same.ScreenRecorder_2022-08-16_26457662-aea5-4a00-aeb4-0fbddf8f16f0.mp4
Description of problem:
The kube-controller-manager pod in openshift-kube-controller-manager namespace keeps reporting "failed to synchronize namespace" after deleing the namespace.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
The namespace has been deleted long ago, but we still kube-controller-manager pod in openshift-kube-controller-manager namespace keeps reporting "failed to synchronize namespace"
Expected results:
It's should not report for deleted namespace
Additional info:
Description of problem:
OSDOCS-7408 lists some commands to be removed from the documentation for MicroShift because they are not supported.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-22293. The following is the description of the original issue:
—
Description of problem:
Upgrading from 4.13.5 to 4.13.17 fails at network operator upgrade
Version-Release number of selected component (if applicable):
How reproducible:
Not sure since we only had one cluster on 4.13.5.
Steps to Reproduce:
1. Have a cluster on version 4.13.5 witn ovn kubernetes 2. Set desired update image to quay.io/openshift-release-dev/ocp-release@sha256:c1f2fa2170c02869484a4e049132128e216a363634d38abf292eef181e93b692 3. Wait until it reaches network operator
Actual results:
Error message: Error while updating operator configuration: could not apply (apps/v1, Kind=DaemonSet) openshift-ovn-kubernetes/ovnkube-master: failed to apply / update (apps/v1, Kind=DaemonSet) openshift-ovn-kubernetes/ovnkube-master: DaemonSet.apps "ovnkube-master" is invalid: [spec.template.spec.containers[1].lifecycle.preStop: Required value: must specify a handler type, spec.template.spec.containers[3].lifecycle.preStop: Required value: must specify a handler type]
Expected results:
Network operator upgrades successfully
Additional info:
Since I'm not able to attach files please gather all required debug data from https://access.redhat.com/support/cases/#/case/03645170
This is a clone of issue OCPBUGS-23362. The following is the description of the original issue:
—
A hostedcluster/hostedcontrolplane were stuck uninstalling. Inspecting the CPO logs, it showed that "error": "failed to delete AWS default security group: failed to delete security group sg-04abe599e5567b025: DependencyViolation: resource sg-04abe599e5567b025 has a dependent object\n\tstatus code: 400, request id: f776a43f-8750-4f04-95ce-457659f59095" Unfortunately, I do not have enough access to the AWS account to inspect this security group, though I know it is the default worker security group because it's recorded in the hostedcluster .status.platform.aws.defaultWorkerSecurityGroupID
Version-Release number of selected component (if applicable):
4.14.1
How reproducible:
I haven't tried to reproduce it yet, but can do so and update this ticket when I do. My theory is:
Steps to Reproduce:
1. Create an AWS HostedCluster, wait for it to create/populate defaultWorkerSecurityGroupID 2. Attach the defaultWorkerSecurityGroupID to anything else in the AWS account unrelated to the HCP cluster 3. Attempt to delete the HostedCluster
Actual results:
CPO logs: "error": "failed to delete AWS default security group: failed to delete security group sg-04abe599e5567b025: DependencyViolation: resource sg-04abe599e5567b025 has a dependent object\n\tstatus code: 400, request id: f776a43f-8750-4f04-95ce-457659f59095"
HostedCluster Status Condition - lastTransitionTime: "2023-11-09T22:18:09Z" message: "" observedGeneration: 3 reason: StatusUnknown status: Unknown type: CloudResourcesDestroyed
Expected results:
I would expect that the CloudResourcesDestroyed status condition on the hostedcluster would reflect this security group as holding up the deletion instead of having to parse through logs.
Additional info:
When creating a HostedCluster from the cli, with KubeVirt platform and external infra-cluster, the creation is failed with this message:
hypershift_framework.go:223: failed to create cluster, tearing down: failed to apply object "e2e-clusters-jqrxx/example-kk2sm": admission webhook "hostedclusters.hypershift.openshift.io" denied the request: Secret "example-kk2sm-infra-credentials" not found
The reason for that is the HosterCluster CR is created before the kubeconfig secret of the external infra-cluster is created. The HostedCluster creation webhook is trying to access the external infra-cluster, fails to find the secret that is not created yet.
Please review the following PR: https://github.com/openshift/csi-driver-shared-resource/pull/142
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/oauth-apiserver/pull/90
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
capi machine cannot be deleted by installer during cluster destroy, checked on GCP console, found this machine lacks label(kubernetes-io-cluster-clusterid: owned), if adding this label manually on GCP console for the machine, then the machine can be deleted by installer during cluster destroy.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2022-10-05-053337
How reproducible:
Always
Steps to Reproduce:
1.Follow the steps here https://bugzilla.redhat.com/show_bug.cgi?id=2107999#c9 to create a capi machine liuhuali@Lius-MacBook-Pro huali-test % oc get machine.cluster.x-k8s.io -n openshift-cluster-api NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION capi-ms-mtchm huliu-gcpx-c55vm gce://openshift-qe/us-central1-a/capi-gcp-machine-template-gcw9t Provisioned 51m 2.Destroy the cluster The cluster destroyed successfully, but checked on GCP console, found the capi machine is still there.
labels of capi machine
labels of mapi machine
Actual results:
capi machine cannot be deleted by installer during cluster destroy
Expected results:
capi machine should be deleted by installer during cluster destroy
Additional info:
Also checked on aws, the case worked well, and found there is tag(kubernetes.io/cluster/clusterid:owned) for capi machines.
ignition-server-proxy pods fail to start after y-stream upgrade because the deployment is configured with a ServiceAccount, set in 4.13, that was deleted in 4.14 in PR https://github.com/openshift/hypershift/pull/2778. The 4.14 reconciliation does not unset the ServiceAccount that was set in 4.13.
After I installed a "Git" Task from ArtifactHub directly the Pipelines Builder and searched for a "git" Task again the Pipeline Builder crashes.
Steps to reproduce:
Actual behaviour
Page crashes
Expected behaviour
Page should not crash
Additional information
Created/Imported Task:
apiVersion: tekton.dev/v1 kind: Task metadata: annotations: openshift.io/installed-from: ArtifactHub tekton.dev/categories: Git tekton.dev/displayName: git tekton.dev/pipelines.minVersion: 0.38.0 tekton.dev/platforms: 'linux/amd64,linux/s390x,linux/ppc64le,linux/arm64' tekton.dev/tags: git resourceVersion: '50218855' name: git uid: 1b88150a-f2c1-4030-9849-c7806c0745d8 creationTimestamp: '2023-11-28T10:54:51Z' generation: 1 labels: app.kubernetes.io/version: 0.1.0 spec: description: | This Task represents Git and is able to initialize and clone a remote repository on the informed Workspace. It's likely to become the first `step` on a Pipeline. params: - description: | Git repository URL. name: URL type: string - default: main description: | Revision to checkout, an branch, tag, sha, ref, etc... name: REVISION type: string - default: '' description: | Repository `refspec` to fetch before checking out the revision. name: REFSPEC type: string - default: 'true' description: | Initialize and fetch Git submodules. name: SUBMODULES type: string - default: '1' description: | Number of commits to fetch, a "shallow clone" is a single commit. name: DEPTH type: string - default: 'true' description: | Sets the global `http.sslVerify` value, `false` is not advised unless you trust the remote repository. name: SSL_VERIFY type: string - default: ca-bundle.crt description: | Certificate Authority (CA) bundle filename on the `ssl-ca-directory` Workspace. name: CRT_FILENAME type: string - default: '' description: | Relative path to the `output` Workspace where the repository will be cloned. name: SUBDIRECTORY type: string - default: '' description: | List of directory patterns split by comma to perform "sparse checkout". name: SPARSE_CHECKOUT_DIRECTORIES type: string - default: 'true' description: | Clean out the contents of the `output` Workspace before cloning the repository, if data exists. name: DELETE_EXISTING type: string - default: '' description: | HTTP proxy server (non-TLS requests). name: HTTP_PROXY type: string - default: '' description: | HTTPS proxy server (TLS requests). name: HTTPS_PROXY type: string - default: '' description: | Opt out of proxying HTTP/HTTPS requests. name: NO_PROXY type: string - default: 'false' description: | Log the commands executed. name: VERBOSE type: string - default: /home/git description: | Absolute path to the Git user home directory. name: USER_HOME type: string results: - description: | The precise commit SHA digest cloned. name: COMMIT type: string - description: | The precise repository URL. name: URL type: string - description: | The epoch timestamp of the commit cloned. name: COMMITTER_DATE type: string stepTemplate: computeResources: limits: cpu: 100m memory: 256Mi requests: cpu: 100m memory: 256Mi env: - name: PARAMS_URL value: $(params.URL) - name: PARAMS_REVISION value: $(params.REVISION) - name: PARAMS_REFSPEC value: $(params.REFSPEC) - name: PARAMS_SUBMODULES value: $(params.SUBMODULES) - name: PARAMS_DEPTH value: $(params.DEPTH) - name: PARAMS_SSL_VERIFY value: $(params.SSL_VERIFY) - name: PARAMS_CRT_FILENAME value: $(params.CRT_FILENAME) - name: PARAMS_SUBDIRECTORY value: $(params.SUBDIRECTORY) - name: PARAMS_SPARSE_CHECKOUT_DIRECTORIES value: $(params.SPARSE_CHECKOUT_DIRECTORIES) - name: PARAMS_DELETE_EXISTING value: $(params.DELETE_EXISTING) - name: PARAMS_HTTP_PROXY value: $(params.HTTP_PROXY) - name: PARAMS_HTTPS_PROXY value: $(params.HTTPS_PROXY) - name: PARAMS_NO_PROXY value: $(params.NO_PROXY) - name: PARAMS_VERBOSE value: $(params.VERBOSE) - name: PARAMS_USER_HOME value: $(params.USER_HOME) - name: WORKSPACES_OUTPUT_PATH value: $(workspaces.output.path) - name: WORKSPACES_SSH_DIRECTORY_BOUND value: $(workspaces.ssh-directory.bound) - name: WORKSPACES_SSH_DIRECTORY_PATH value: $(workspaces.ssh-directory.path) - name: WORKSPACES_BASIC_AUTH_BOUND value: $(workspaces.basic-auth.bound) - name: WORKSPACES_BASIC_AUTH_PATH value: $(workspaces.basic-auth.path) - name: WORKSPACES_SSL_CA_DIRECTORY_BOUND value: $(workspaces.ssl-ca-directory.bound) - name: WORKSPACES_SSL_CA_DIRECTORY_PATH value: $(workspaces.ssl-ca-directory.path) - name: RESULTS_COMMITTER_DATE_PATH value: $(results.COMMITTER_DATE.path) - name: RESULTS_COMMIT_PATH value: $(results.COMMIT.path) - name: RESULTS_URL_PATH value: $(results.URL.path) securityContext: runAsNonRoot: true runAsUser: 65532 steps: - computeResources: {} image: 'gcr.io/tekton-releases/github.com/tektoncd/pipeline/cmd/git-init:latest' name: load-scripts script: | printf '%s' "IyEvdXNyL2Jpbi9lbnYgc2gKCmV4cG9ydCBQQVJBTVNfVVJMPSIke1BBUkFNU19VUkw6LX0iCmV4cG9ydCBQQVJBTVNfUkVWSVNJT049IiR7UEFSQU1TX1JFVklTSU9OOi19IgpleHBvcnQgUEFSQU1TX1JFRlNQRUM9IiR7UEFSQU1TX1JFRlNQRUM6LX0iCmV4cG9ydCBQQVJBTVNfU1VCTU9EVUxFUz0iJHtQQVJBTVNfU1VCTU9EVUxFUzotfSIKZXhwb3J0IFBBUkFNU19ERVBUSD0iJHtQQVJBTVNfREVQVEg6LX0iCmV4cG9ydCBQQVJBTVNfU1NMX1ZFUklGWT0iJHtQQVJBTVNfU1NMX1ZFUklGWTotfSIKZXhwb3J0IFBBUkFNU19DUlRfRklMRU5BTUU9IiR7UEFSQU1TX0NSVF9GSUxFTkFNRTotfSIKZXhwb3J0IFBBUkFNU19TVUJESVJFQ1RPUlk9IiR7UEFSQU1TX1NVQkRJUkVDVE9SWTotfSIKZXhwb3J0IFBBUkFNU19TUEFSU0VfQ0hFQ0tPVVRfRElSRUNUT1JJRVM9IiR7UEFSQU1TX1NQQVJTRV9DSEVDS09VVF9ESVJFQ1RPUklFUzotfSIKZXhwb3J0IFBBUkFNU19ERUxFVEVfRVhJU1RJTkc9IiR7UEFSQU1TX0RFTEVURV9FWElTVElORzotfSIKZXhwb3J0IFBBUkFNU19IVFRQX1BST1hZPSIke1BBUkFNU19IVFRQX1BST1hZOi19IgpleHBvcnQgUEFSQU1TX0hUVFBTX1BST1hZPSIke1BBUkFNU19IVFRQU19QUk9YWTotfSIKZXhwb3J0IFBBUkFNU19OT19QUk9YWT0iJHtQQVJBTVNfTk9fUFJPWFk6LX0iCmV4cG9ydCBQQVJBTVNfVkVSQk9TRT0iJHtQQVJBTVNfVkVSQk9TRTotfSIKZXhwb3J0IFBBUkFNU19VU0VSX0hPTUU9IiR7UEFSQU1TX1VTRVJfSE9NRTotfSIKCmV4cG9ydCBXT1JLU1BBQ0VTX09VVFBVVF9QQVRIPSIke1dPUktTUEFDRVNfT1VUUFVUX1BBVEg6LX0iCmV4cG9ydCBXT1JLU1BBQ0VTX1NTSF9ESVJFQ1RPUllfQk9VTkQ9IiR7V09SS1NQQUNFU19TU0hfRElSRUNUT1JZX0JPVU5EOi19IgpleHBvcnQgV09SS1NQQUNFU19TU0hfRElSRUNUT1JZX1BBVEg9IiR7V09SS1NQQUNFU19TU0hfRElSRUNUT1JZX1BBVEg6LX0iCmV4cG9ydCBXT1JLU1BBQ0VTX0JBU0lDX0FVVEhfQk9VTkQ9IiR7V09SS1NQQUNFU19CQVNJQ19BVVRIX0JPVU5EOi19IgpleHBvcnQgV09SS1NQQUNFU19CQVNJQ19BVVRIX1BBVEg9IiR7V09SS1NQQUNFU19CQVNJQ19BVVRIX1BBVEg6LX0iCmV4cG9ydCBXT1JLU1BBQ0VTX1NTTF9DQV9ESVJFQ1RPUllfQk9VTkQ9IiR7V09SS1NQQUNFU19TU0xfQ0FfRElSRUNUT1JZX0JPVU5EOi19IgpleHBvcnQgV09SS1NQQUNFU19TU0xfQ0FfRElSRUNUT1JZX1BBVEg9IiR7V09SS1NQQUNFU19TU0xfQ0FfRElSRUNUT1JZX1BBVEg6LX0iCgpleHBvcnQgUkVTVUxUU19DT01NSVRURVJfREFURV9QQVRIPSIke1JFU1VMVFNfQ09NTUlUVEVSX0RBVEVfUEFUSDotfSIKZXhwb3J0IFJFU1VMVFNfQ09NTUlUX1BBVEg9IiR7UkVTVUxUU19DT01NSVRfUEFUSDotfSIKZXhwb3J0IFJFU1VMVFNfVVJMX1BBVEg9IiR7UkVTVUxUU19VUkxfUEFUSDotfSIKCiMgZnVsbCBwYXRoIHRvIHRoZSBjaGVja291dCBkaXJlY3RvcnksIHVzaW5nIHRoZSBvdXRwdXQgd29ya3NwYWNlIGFuZCBzdWJkaXJlY3RvciBwYXJhbWV0ZXIKZXhwb3J0IGNoZWNrb3V0X2Rpcj0iJHtXT1JLU1BBQ0VTX09VVFBVVF9QQVRIfS8ke1BBUkFNU19TVUJESVJFQ1RPUll9IgoKIwojIEZ1bmN0aW9ucwojCgpmYWlsKCkgewogICAgZWNobyAiRVJST1I6ICR7QH0iIDE+JjIKICAgIGV4aXQgMQp9CgpwaGFzZSgpIHsKICAgIGVjaG8gIi0tLT4gUGhhc2U6ICR7QH0uLi4iCn0KCiMgSW5zcGVjdCB0aGUgZW52aXJvbm1lbnQgdmFyaWFibGVzIHRvIGFzc2VydCB0aGUgbWluaW11bSBjb25maWd1cmF0aW9uIGlzIGluZm9ybWVkLgphc3NlcnRfcmVxdWlyZWRfY29uZmlndXJhdGlvbl9vcl9mYWlsKCkgewogICAgW1sgLXogIiR7UEFSQU1TX1VSTH0iIF1dICYmCiAgICAgICAgZmFpbCAiUGFyYW1ldGVyIFVSTCBpcyBub3Qgc2V0ISIKCiAgICBbWyAteiAiJHtXT1JLU1BBQ0VTX09VVFBVVF9QQVRIfSIgXV0gJiYKICAgICAgICBmYWlsICJPdXRwdXQgV29ya3NwYWNlIGlzIG5vdCBzZXQhIgoKICAgIFtbICEgLWQgIiR7V09SS1NQQUNFU19PVVRQVVRfUEFUSH0iIF1dICYmCiAgICAgICAgZmFpbCAiT3V0cHV0IFdvcmtzcGFjZSBkaXJlY3RvcnkgJyR7V09SS1NQQUNFU19PVVRQVVRfUEFUSH0nIG5vdCBmb3VuZCEiCgogICAgcmV0dXJuIDAKfQoKIyBDb3B5IHRoZSBmaWxlIGludG8gdGhlIGRlc3RpbmF0aW9uLCBjaGVja2luZyBpZiB0aGUgc291cmNlIGV4aXN0cy4KY29weV9vcl9mYWlsKCkgewogICAgbG9jYWwgX21vZGU9IiR7MX0iCiAgICBsb2NhbCBfc3JjPSIkezJ9IgogICAgbG9jYWwgX2RzdD0iJHszfSIKCiAgICBpZiBbWyAhIC1mICIke19zcmN9IiAmJiAhIC1kICIke19zcmN9IiBdXTsgdGhlbgogICAgICAgIGZhaWwgIlNvdXJjZSBmaWxlL2RpcmVjdG9yeSBpcyBub3QgZm91bmQgYXQgJyR7X3NyY30nIgogICAgZmkKCiAgICBpZiBbWyAtZCAiJHtfc3JjfSIgXV07IHRoZW4KICAgICAgICBjcCAtUnYgJHtfc3JjfSAke19kc3R9CiAgICAgICAgY2htb2QgLXYgJHtfbW9kZX0gJHtfZHN0fQogICAgZWxzZQogICAgICAgIGluc3RhbGwgLS12ZXJib3NlIC0tbW9kZT0ke19tb2RlfSAke19zcmN9ICR7X2RzdH0KICAgIGZpCn0KCiMgRGVsZXRlIGFueSBleGlzdGluZyBjb250ZW50cyBvZiB0aGUgcmVwbyBkaXJlY3RvcnkgaWYgaXQgZXhpc3RzLiBXZSBkb24ndCBqdXN0ICJybSAtcmYgPGRpcj4iCiMgYmVjYXVzZSBtaWdodCBiZSAiLyIgb3IgdGhlIHJvb3Qgb2YgYSBtb3VudGVkIHZvbHVtZS4KY2xlYW5fZGlyKCkgewogICAgbG9jYWwgX2Rpcj0iJHsxfSIKCiAgICBbWyAhIC1kICIke19kaXJ9IiBdXSAmJgogICAgICAgIHJldHVybiAwCgogICAgIyBEZWxldGUgbm9uLWhpZGRlbiBmaWxlcyBhbmQgZGlyZWN0b3JpZXMKICAgIHJtIC1yZnYgJHtfZGlyOj99LyoKICAgICMgRGVsZXRlIGZpbGVzIGFuZCBkaXJlY3RvcmllcyBzdGFydGluZyB3aXRoIC4gYnV0IGV4Y2x1ZGluZyAuLgogICAgcm0gLXJmdiAke19kaXJ9Ly5bIS5dKgogICAgIyBEZWxldGUgZmlsZXMgYW5kIGRpcmVjdG9yaWVzIHN0YXJ0aW5nIHdpdGggLi4gcGx1cyBhbnkgb3RoZXIgY2hhcmFjdGVyCiAgICBybSAtcmZ2ICR7X2Rpcn0vLi4/Kgp9CgojCiMgU2V0dGluZ3MKIwoKIyB3aGVuIHRoZSBrby1hcHAgZGlyZWN0b3J5IGlzIHByZXNlbnQsIG1ha2luZyBzdXJlIGl0J3MgcGFydCBvZiB0aGUgUEFUSApbWyAtZCAiL2tvLWFwcCIgXV0gJiYgZXhwb3J0IFBBVEg9IiR7UEFUSH06L2tvLWFwcCIKCiMgbWFraW5nIHRoZSBzaGVsbCB2ZXJib3NlIHdoZW4gdGhlIHBhcmFtdGVyIGlzIHNldApbWyAiJHtQQVJBTVNfVkVSQk9TRX0iID09ICJ0cnVlIiBdXSAmJiBzZXQgLXgKCnJldHVybiAw" |base64 -d >common.sh chmod +x "common.sh" printf '%s' "IyEvdXNyL2Jpbi9lbnYgc2gKIwojIEV4cG9ydHMgcHJveHkgYW5kIGN1c3RvbSBTU0wgQ0EgY2VydGlmaWNhdHMgaW4gdGhlIGVudmlyb21lbnQgYW5kIHJ1bnMgdGhlIGdpdC1pbml0IHdpdGggZmxhZ3MKIyBiYXNlZCBvbiB0aGUgdGFzayBwYXJhbWV0ZXJzLgojCgpzZXQgLWV1Cgpzb3VyY2UgJChDRFBBVEg9IGNkIC0tICIkKGRpcm5hbWUgLS0gJHswfSkiICYmIHB3ZCkvY29tbW9uLnNoCgphc3NlcnRfcmVxdWlyZWRfY29uZmlndXJhdGlvbl9vcl9mYWlsCgojCiMgQ0EgKGBzc2wtY2EtZGlyZWN0b3J5YCBXb3Jrc3BhY2UpCiMKCmlmIFtbICIke1dPUktTUEFDRVNfU1NMX0NBX0RJUkVDVE9SWV9CT1VORH0iID09ICJ0cnVlIiAmJiAtbiAiJHtQQVJBTVNfQ1JUX0ZJTEVOQU1FfSIgXV07IHRoZW4KCXBoYXNlICJJbnNwZWN0aW5nICdzc2wtY2EtZGlyZWN0b3J5JyB3b3Jrc3BhY2UgbG9va2luZyBmb3IgJyR7UEFSQU1TX0NSVF9GSUxFTkFNRX0nIGZpbGUiCgljcnQ9IiR7V09SS1NQQUNFU19TU0xfQ0FfRElSRUNUT1JZX1BBVEh9LyR7UEFSQU1TX0NSVF9GSUxFTkFNRX0iCglbWyAhIC1mICIke2NydH0iIF1dICYmCgkJZmFpbCAiQ1JUIGZpbGUgKFBBUkFNU19DUlRfRklMRU5BTUUpIG5vdCBmb3VuZCBhdCAnJHtjcnR9JyIKCglwaGFzZSAiRXhwb3J0aW5nIGN1c3RvbSBDQSBjZXJ0aWZpY2F0ZSAnR0lUX1NTTF9DQUlORk89JHtjcnR9JyIKCWV4cG9ydCBHSVRfU1NMX0NBSU5GTz0ke2NydH0KZmkKCiMKIyBQcm94eSBTZXR0aW5ncwojCgpwaGFzZSAiU2V0dGluZyB1cCBIVFRQX1BST1hZPScke1BBUkFNU19IVFRQX1BST1hZfSciCltbIC1uICIke1BBUkFNU19IVFRQX1BST1hZfSIgXV0gJiYgZXhwb3J0IEhUVFBfUFJPWFk9IiR7UEFSQU1TX0hUVFBfUFJPWFl9IgoKcGhhc2UgIlNldHR0aW5nIHVwIEhUVFBTX1BST1hZPScke1BBUkFNU19IVFRQU19QUk9YWX0nIgpbWyAtbiAiJHtQQVJBTVNfSFRUUFNfUFJPWFl9IiBdXSAmJiBleHBvcnQgSFRUUFNfUFJPWFk9IiR7UEFSQU1TX0hUVFBTX1BST1hZfSIKCnBoYXNlICJTZXR0aW5nIHVwIE5PX1BST1hZPScke1BBUkFNU19OT19QUk9YWX0nIgpbWyAtbiAiJHtQQVJBTVNfTk9fUFJPWFl9IiBdXSAmJiBleHBvcnQgTk9fUFJPWFk9IiR7UEFSQU1TX05PX1BST1hZfSIKCiMKIyBHaXQgQ2xvbmUKIwoKcGhhc2UgIlNldHRpbmcgb3V0cHV0IHdvcmtzcGFjZSBhcyBzYWZlIGRpcmVjdG9yeSAoJyR7V09SS1NQQUNFU19PVVRQVVRfUEFUSH0nKSIKZ2l0IGNvbmZpZyAtLWdsb2JhbCAtLWFkZCBzYWZlLmRpcmVjdG9yeSAiJHtXT1JLU1BBQ0VTX09VVFBVVF9QQVRIfSIKCnBoYXNlICJDbG9uaW5nICcke1BBUkFNU19VUkx9JyBpbnRvICcke2NoZWNrb3V0X2Rpcn0nIgpzZXQgLXgKZXhlYyBnaXQtaW5pdCBcCgktdXJsPSIke1BBUkFNU19VUkx9IiBcCgktcmV2aXNpb249IiR7UEFSQU1TX1JFVklTSU9OfSIgXAoJLXJlZnNwZWM9IiR7UEFSQU1TX1JFRlNQRUN9IiBcCgktcGF0aD0iJHtjaGVja291dF9kaXJ9IiBcCgktc3NsVmVyaWZ5PSIke1BBUkFNU19TU0xfVkVSSUZZfSIgXAoJLXN1Ym1vZHVsZXM9IiR7UEFSQU1TX1NVQk1PRFVMRVN9IiBcCgktZGVwdGg9IiR7UEFSQU1TX0RFUFRIfSIgXAoJLXNwYXJzZUNoZWNrb3V0RGlyZWN0b3JpZXM9IiR7UEFSQU1TX1NQQVJTRV9DSEVDS09VVF9ESVJFQ1RPUklFU30iCg==" |base64 -d >git-clone.sh chmod +x "git-clone.sh" printf '%s' "IyEvdXNyL2Jpbi9lbnYgc2gKIwojIFNldHMgdXAgdGhlIGJhc2ljIGFuZCBTU0ggYXV0aGVudGljYXRpb24gYmFzZWQgb24gaW5mb3JtZWQgd29ya3NwYWNlcywgYXMgd2VsbCBhcyBjbGVhbmluZyB1cCB0aGUKIyBwcmV2aW91cyBnaXQtY2xvbmUgc3RhbGUgZGF0YS4KIwoKc2V0IC1ldQoKc291cmNlICQoQ0RQQVRIPSBjZCAtLSAiJChkaXJuYW1lIC0tICR7MH0pIiAmJiBwd2QpL2NvbW1vbi5zaAoKYXNzZXJ0X3JlcXVpcmVkX2NvbmZpZ3VyYXRpb25fb3JfZmFpbAoKcGhhc2UgIlByZXBhcmluZyB0aGUgZmlsZXN5c3RlbSBiZWZvcmUgY2xvbmluZyB0aGUgcmVwb3NpdG9yeSIKCmlmIFtbICIke1dPUktTUEFDRVNfQkFTSUNfQVVUSF9CT1VORH0iID09ICJ0cnVlIiBdXTsgdGhlbgoJcGhhc2UgIkNvbmZpZ3VyaW5nIEdpdCBhdXRoZW50aWNhdGlvbiB3aXRoICdiYXNpYy1hdXRoJyBXb3Jrc3BhY2UgZmlsZXMiCgoJZm9yIGYgaW4gLmdpdC1jcmVkZW50aWFscyAuZ2l0Y29uZmlnOyBkbwoJCXNyYz0iJHtXT1JLU1BBQ0VTX0JBU0lDX0FVVEhfUEFUSH0vJHtmfSIKCQlwaGFzZSAiQ29weWluZyAnJHtzcmN9JyB0byAnJHtQQVJBTVNfVVNFUl9IT01FfSciCgkJY29weV9vcl9mYWlsIDQwMCAke3NyY30gIiR7UEFSQU1TX1VTRVJfSE9NRX0vIgoJZG9uZQpmaQoKaWYgW1sgIiR7V09SS1NQQUNFU19TU0hfRElSRUNUT1JZX0JPVU5EfSIgPT0gInRydWUiIF1dOyB0aGVuCglwaGFzZSAiQ29weWluZyAnLnNzaCcgZnJvbSBzc2gtZGlyZWN0b3J5IHdvcmtzcGFjZSAoJyR7V09SS1NQQUNFU19TU0hfRElSRUNUT1JZX1BBVEh9JykiCgoJZG90X3NzaD0iJHtQQVJBTVNfVVNFUl9IT01FfS8uc3NoIgoJY29weV9vcl9mYWlsIDcwMCAke1dPUktTUEFDRVNfU1NIX0RJUkVDVE9SWV9QQVRIfSAke2RvdF9zc2h9CgljaG1vZCAtUnYgNDAwICR7ZG90X3NzaH0vKgpmaQoKaWYgW1sgIiR7UEFSQU1TX0RFTEVURV9FWElTVElOR30iID09ICJ0cnVlIiBdXTsgdGhlbgoJcGhhc2UgIkRlbGV0aW5nIGFsbCBjb250ZW50cyBvZiBjaGVja291dC1kaXIgJyR7Y2hlY2tvdXRfZGlyfSciCgljbGVhbl9kaXIgJHtjaGVja291dF9kaXJ9IHx8IHRydWUKZmkKCmV4aXQgMA==" |base64 -d >prepare.sh chmod +x "prepare.sh" printf '%s' "IyEvdXNyL2Jpbi9lbnYgc2gKIwojIFNjYW4gdGhlIGNsb25lZCByZXBvc2l0b3J5IGluIG9yZGVyIHRvIHJlcG9ydCBkZXRhaWxzIHdyaXR0aW5nIHRoZSByZXN1bHQgZmlsZXMuCiMKCnNldCAtZXUKCnNvdXJjZSAkKENEUEFUSD0gY2QgLS0gIiQoZGlybmFtZSAtLSAkezB9KSIgJiYgcHdkKS9jb21tb24uc2gKCmFzc2VydF9yZXF1aXJlZF9jb25maWd1cmF0aW9uX29yX2ZhaWwKCnBoYXNlICJDb2xsZWN0aW5nIGNsb25lZCByZXBvc2l0b3J5IGluZm9ybWF0aW9uICgnJHtjaGVja291dF9kaXJ9JykiCgpjZCAiJHtjaGVja291dF9kaXJ9IiB8fCBmYWlsICJOb3QgYWJsZSB0byBlbnRlciBjaGVja291dC1kaXIgJyR7Y2hlY2tvdXRfZGlyfSciCgpwaGFzZSAiU2V0dGluZyBvdXRwdXQgd29ya3NwYWNlIGFzIHNhZmUgZGlyZWN0b3J5ICgnJHtXT1JLU1BBQ0VTX09VVFBVVF9QQVRIfScpIgpnaXQgY29uZmlnIC0tZ2xvYmFsIC0tYWRkIHNhZmUuZGlyZWN0b3J5ICIke1dPUktTUEFDRVNfT1VUUFVUX1BBVEh9IgoKcmVzdWx0X3NoYT0iJChnaXQgcmV2LXBhcnNlIEhFQUQpIgpyZXN1bHRfY29tbWl0dGVyX2RhdGU9IiQoZ2l0IGxvZyAtMSAtLXByZXR0eT0lY3QpIgoKcGhhc2UgIlJlcG9ydGluZyBsYXN0IGNvbW1pdCBkYXRlICcke3Jlc3VsdF9jb21taXR0ZXJfZGF0ZX0nIgpwcmludGYgIiVzIiAiJHtyZXN1bHRfY29tbWl0dGVyX2RhdGV9IiA+JHtSRVNVTFRTX0NPTU1JVFRFUl9EQVRFX1BBVEh9CgpwaGFzZSAiUmVwb3J0aW5nIHBhcnNlZCByZXZpc2lvbiBTSEEgJyR7cmVzdWx0X3NoYX0nIgpwcmludGYgIiVzIiAiJHtyZXN1bHRfc2hhfSIgPiR7UkVTVUxUU19DT01NSVRfUEFUSH0KCnBoYXNlICJSZXBvcnRpbmcgcmVwb3NpdG9yeSBVUkwgJyR7UEFSQU1TX1VSTH0nIgpwcmludGYgIiVzIiAiJHtQQVJBTVNfVVJMfSIgPiR7UkVTVUxUU19VUkxfUEFUSH0KCmV4aXQgMA==" |base64 -d >report.sh chmod +x "report.sh" volumeMounts: - mountPath: /scripts name: scripts-dir workingDir: /scripts - command: - /scripts/prepare.sh computeResources: {} image: 'gcr.io/tekton-releases/github.com/tektoncd/pipeline/cmd/git-init:latest' name: prepare volumeMounts: - mountPath: /scripts name: scripts-dir - mountPath: $(params.USER_HOME) name: user-home - command: - /scripts/git-clone.sh computeResources: {} image: 'gcr.io/tekton-releases/github.com/tektoncd/pipeline/cmd/git-init:latest' name: git-clone volumeMounts: - mountPath: /scripts name: scripts-dir - mountPath: $(params.USER_HOME) name: user-home - command: - /scripts/report.sh computeResources: {} image: 'gcr.io/tekton-releases/github.com/tektoncd/pipeline/cmd/git-init:latest' name: report volumeMounts: - mountPath: /scripts name: scripts-dir volumes: - emptyDir: {} name: user-home - emptyDir: {} name: scripts-dir workspaces: - description: | The Git repository directory, data will be placed on the root of the Workspace, or on the relative path defined by the SUBDIRECTORY parameter. name: output - description: | A `.ssh` directory with private key, `known_hosts`, `config`, etc. Copied to the Git user's home before cloning the repository, in order to server as authentication mechanismBinding a Secret to this Workspace is strongly recommended over other volume types. name: ssh-directory optional: true - description: | A Workspace containing a `.gitconfig` and `.git-credentials` files. These will be copied to the user's home before Git commands run. All other files in this Workspace are ignored. It is strongly recommended to use `ssh-directory` over `basic-auth` whenever possible, and to bind a Secret to this Workspace over other volume types. name: basic-auth optional: true - description: | A Workspace containing CA certificates, this will be used by Git to verify the peer with when interacting with remote repositories using HTTPS. name: ssl-ca-directory optional: true
Description of problem:
OLMv0 over-uses listers and consumes too much memory. Also, $GOMEMLIMIT is not used and the runtime overcommits on RSS. See the following doc for more detail: https://docs.google.com/document/d/11J7lv1HtEq_c3l6fLTWfsom8v1-7guuG4DziNQDU6cY/edit#heading=h.ttj9tfltxgzt
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
IPI installation using the service account attached to a GCP VM always fail with error "unable to parse credentials"
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-09-15-233408
How reproducible:
Always
Steps to Reproduce:
1. "create install-config" 2. edit install-config.yaml to insert "credentialsMode: Manual" 3. "create manifests" 4. manually create the required credentials and copy the manifests to installation-dir/manifests directory 5. launch the bastion host along with binding to the pre-configured service account ipi-on-bastion-sa@openshift-qe.iam.gserviceaccount.com and scopes being "cloud-platform" 6. copy the installation-dir and openshift-install to the bastion host 7. try "create cluster" on the bastion host
Actual results:
The installation failed on "Creating infrastructure resources"
Expected results:
The installation should succeed.
Additional info:
(1) FYI the 4.12 epic: https://issues.redhat.com/browse/CORS-2260 (2) 4.12.34 doesn't have the issue (Flexy-install/234112/). (3) 4.13.13 doesn’t have the issue (Flexy-install/234126/). (4) The 4.14 errors (Flexy-install/234113/): 09-19 16:13:44.919 level=info msg=Consuming Master Ignition Config from target directory 09-19 16:13:44.919 level=info msg=Consuming Bootstrap Ignition Config from target directory 09-19 16:13:44.919 level=info msg=Consuming Worker Ignition Config from target directory 09-19 16:13:44.919 level=info msg=Credentials loaded from gcloud CLI defaults 09-19 16:13:49.071 level=info msg=Creating infrastructure resources... 09-19 16:13:50.950 level=error 09-19 16:13:50.950 level=error msg=Error: unable to parse credentials 09-19 16:13:50.950 level=error 09-19 16:13:50.950 level=error msg= with provider["openshift/local/google"], 09-19 16:13:50.950 level=error msg= on main.tf line 10, in provider "google": 09-19 16:13:50.950 level=error msg= 10: provider "google" { 09-19 16:13:50.950 level=error 09-19 16:13:50.950 level=error msg=unexpected end of JSON input 09-19 16:13:50.950 level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "cluster" stage: failed to create cluster: failed to apply Terraform: exit status 1 09-19 16:13:50.950 level=error 09-19 16:13:50.950 level=error msg=Error: unable to parse credentials 09-19 16:13:50.950 level=error 09-19 16:13:50.950 level=error msg= with provider["openshift/local/google"], 09-19 16:13:50.950 level=error msg= on main.tf line 10, in provider "google": 09-19 16:13:50.950 level=error msg= 10: provider "google" { 09-19 16:13:50.950 level=error 09-19 16:13:50.950 level=error msg=unexpected end of JSON input 09-19 16:13:50.950 level=error
Testcases:
1. Create a configmap from a file with 77 characters in a line
File data: tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt CLI data: $ oc get cm cm-test4 -o yaml apiVersion: v1 data: cm-test4: | ##Noticed the Literal style tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt kind: ConfigMap metadata: creationTimestamp: "2022-09-28T12:39:43Z" name: cm-test4 namespace: configmap-test resourceVersion: "8962738" uid: cf0e264b-72fb-4df7-bd3a-f3ed62423367 UI data: kind: ConfigMap apiVersion: v1 metadata: name: cm-test4 namespace: configmap-test uid: cf0e264b-72fb-4df7-bd3a-f3ed62423367 resourceVersion: '8962738' creationTimestamp: '2022-09-28T12:39:43Z' managedFields: - manager: kubectl-create operation: Update apiVersion: v1 time: '2022-09-28T12:39:43Z' fieldsType: FieldsV1 fieldsV1: 'f:data': .: {} 'f:cm-test4': {} data: cm-test4: | ##Noticed the Literal style tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
2. Create a configmap from a file with characters more than 78 in a line,
File Data: tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt CLI Data: $ oc get cm cm-test5 -o yaml apiVersion: v1 data: cm-test5: | ##Noticed the Literal style tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt kind: ConfigMap metadata: creationTimestamp: "2022-09-28T12:39:54Z" name: cm-test5 namespace: configmap-test resourceVersion: "8962813" uid: b8b12653-588a-4afc-8ed9-ff7c6ebaefb1 UI data: kind: ConfigMap apiVersion: v1 metadata: name: cm-test5 namespace: configmap-test uid: b8b12653-588a-4afc-8ed9-ff7c6ebaefb1 resourceVersion: '8962813' creationTimestamp: '2022-09-28T12:39:54Z' managedFields: - manager: kubectl-create operation: Update apiVersion: v1 time: '2022-09-28T12:39:54Z' fieldsType: FieldsV1 fieldsV1: 'f:data': .: {} 'f:cm-test5': {} data: cm-test5: > ##Noticed the Folded style and newlines in between data tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee ssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt
Conclusion:
When the CM is created with more than 78 characters in a single line the yaml editor in the web UI changes the style to folded and could see newline in between data.
Issue and 45 and 55 from https://docs.google.com/spreadsheets/d/1TR3ENY-GE_LQL9F-xH6NahtRHdu6IrYGhLqDDA8y0EI/edit#gid=1035185624
When created operator-backed with service binding, the application group visual doesn't show up
Note: Is this really PF5-related, or does this issue exist already on 4.14?
Screenshot: https://drive.google.com/drive/u/1/folders/1OKeJ8PPGZi-1QyqQ184xQznmqii37NNB
e2e-aws-serial-techpreview lane under openshift/api is falling:
shared-resource-csi-driver-operator fails with:
failed to list *v1.APIServer: apiservers.config.openshift.io is forbidden: User "system:serviceaccount:openshift-cluster-csi-drivers:shared-resource-csi-driver-operator" cannot list resource "apiservers" in API group "config.openshift.io" at the cluster scope
This is a clone of issue OCPBUGS-30580. The following is the description of the original issue:
—
Description of problem:
KAS labels on projects created should be consistent with OCP - enforce: privileged
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
See https://issues.redhat.com/browse/OCPBUGS-20526.
Steps to Reproduce:
See https://issues.redhat.com/browse/OCPBUGS-20526.
Actual results:
See https://issues.redhat.com/browse/OCPBUGS-20526.
Expected results:
See https://issues.redhat.com/browse/OCPBUGS-20526.
Additional info:
See https://issues.redhat.com/browse/OCPBUGS-20526.
Fix jq command in local cmo run
Description of problem:
IHAC with OCP 4.9 who has configured the IngressControllers with a long httpLogFormat, and the routers are printing every time it reloads
I0927 13:29:45.495077 1 router.go:612] template "msg"="router reloaded" "output"="[WARNING] 269/132945 (9167) : config : truncating capture length to 63 bytes for frontend 'public'.\n[WARNING] 269/132945 (9167) : config : truncating capture length to 63 bytes for frontend 'fe_sni'.\n[WARNING] 269/132945 (9167) : config : truncating capture length to 63 bytes for frontend 'fe_no_sni'.\n - Checking http://localhost:80 ...\n - Health check ok : 0 retry attempt(s).\n"
This is the Ingress Contoller configuration:
logging: access: destination: syslog: address: 10.X.X.X port: 10514 type: Syslog httpCaptureCookies: - matchType: Exact maxLength: 128 name: ITXSESSIONID httpCaptureHeaders: request: - maxLength: 128 name: Host - maxLength: 128 name: itxrequestid httpLogFormat: actconn="%ac",backend_name="%b",backend_queue="%bq",backend_source_ip="%bi",backend_source_port="%bp",beconn="%bc",bytes_read="%B",bytes_uploaded="%U",captrd_req_cookie="%CC",captrd_req_headers="%hr",captrd_res_cookie="%CS",captrd_res_headers="%hs",client_ip="%ci",client_port="%cp",cluster="ieec1ocp1",datacenter="ieec1",environment="pro",fe_name_transport="%ft",feconn="%fc",frontend_name="%f",hostname="%H",http_version="%HV",log_type="http",method="%HM",query_string="%HQ",req_date="%tr",request="%HP",res_time="%TR",retries="%rc",server_ip="%si",server_name="%s",server_port="%sp",srv_queue="%sq",srv_conn="%sc",srv_queue="%sq",status_code="%ST",Ta="%Ta",Tc="%Tc",tenant="bk",term_state="%tsc",tot_wait_q="%Tw",Tr="%Tr" logEmptyRequests: Ignore
Any way to avoid this truncate warning?
How reproducible:
For every reload of haproxy config
Steps to Reproduce:
You can reproduce easily with the following configuration in the default ingress controller:
logging:
access:
destination:
type: Container
httpCaptureCookies:
2022-10-18T14:13:53.068164+00:00 xxxx xxxxxx haproxy[38]: 10.39.192.203:40698 [18/Oct/2022:14:13:52.488] fe_sni~ be_secure:openshift-console:console/pod:console-5976495467-zxgxr:console:https:10.128.1.116:8443 0/0/0/10/580 200 1130598 _abck=B7EA642C9E828FA8210F329F80B7B2D80YAAQnVozuFVfkOaDAQAADk - --VN 78/37/33/33/0 0/0 "GET /api/kubernetes/openapi/v2 HTTP/1.1"
In spyglass charts rows sometimes require an additional field added to the locator to make things appear on separate lines. (node state is a great example where we need os update, phases, and notready, all on separate lines, otherwise they would overlap and we wouldn't be able to see anything). This will also be useful for pod logs and similar.
Our goal is origin being able to add new intervals, without requiring an update to the js (which will be in sippy) to get things to display properly. We need a way to differentiate structured intervals into separate rows within the same group.
Leaning towards row/foo in the locator, as this value for each row is the locator.
Description of problem:
Upon installing 4.14.0-rc.6 in a cluster with private load balancer publishing and existing vnets Service type LoadBalancers lack permissions necessary to sync.
Version-Release number of selected component (if applicable):
4.14.0-rc.6
How reproducible:
Seemingly 100%
Steps to Reproduce:
1. Install w/ azure Managed Identity into an existing vnet with private LB publishing 2. 3.
Actual results:
One or more other status conditions indicate a degraded state: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 403, RawError: {"error":{"code":"AuthorizationFailed","message":"The client '194d5669-cb47-4199-a673-4b32a4a110be' with object id '194d5669-cb47-4199-a673-4b32a4a110be' does not have authorization to perform action 'Microsoft.Network/virtualNetworks/subnets/read' over scope '/subscriptions/14b86a40-8d8f-4e69-abaf-42cbb0b8a331/resourceGroups/net/providers/Microsoft.Network/virtualNetworks/rnd-we-net/subnets/paas1' or the scope is invalid. If access was recently granted, please refresh your credentials."}} Operators dependent on Ingress are failing as well. authentication 4.14.0-rc.6 False False True 149m OAuthServerRouteEndpointAccessibleControllerAvailable: Get https://oauth-openshift.apps.cnb10161.rnd.westeurope.example.com/healthz: dial tcp: lookup oauth-openshift.apps.cnb10161.rnd.westeurope.example.com on 10.224.0.10:53: no such host (this is likely result of malfunctioning DNS server) console 4.14.0-rc.6 False True False 142m DeploymentAvailable: 0 replicas available for console deployment...
Expected results:
Successful install
Additional info:
The client ID in the error correspond to “openshift-cloud-controller-manager-azure-cloud-credentials” which indeed when checking its Azure managed identity only has access to cluster RG and not the network RG. Additionally, they note that this permission is granted to the MAPI roles just not the CCM roles.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/multus-admission-controller/pull/77
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-28625. The following is the description of the original issue:
—
Description of problem:
HCP does not honor the oauthMetadata field of hc.spec.configuration.authentication, making console crash and oc login fail.
Version-Release number of selected component (if applicable):
HyperShift management cluster: 4.16.0-0.nightly-2024-01-29-233218 HyperShift hosted cluster: 4.16.0-0.nightly-2024-01-29-233218
How reproducible:
Always
Steps to Reproduce:
1. Install HCP env. Export KUBECONFIG: $ export KUBECONFIG=/path/to/hosted-cluster/kubeconfig 2. Create keycloak applications. Then get the route: $ KEYCLOAK_HOST=https://$(oc get -n keycloak route keycloak --template='{{ .spec.host }}') $ echo $KEYCLOAK_HOST https://keycloak-keycloak.apps.hypershift-ci-18556.xxx $ curl -sSk "$KEYCLOAK_HOST/realms/master/.well-known/openid-configuration" > oauthMetadata $ cat oauthMetadata {"issuer":"https://keycloak-keycloak.apps.hypershift-ci-18556.xxx/realms/master" $ oc create configmap oauth-meta --from-file ./oauthMetadata -n clusters --kubeconfig /path/to/management-cluster/kubeconfig ... 3. Set hc.spec.configuration.authentication: $ CLIENT_ID=openshift-test-aud $ oc patch hc hypershift-ci-18556 -n clusters --kubeconfig /path/to/management-cluster/kubeconfig --type=merge -p=" spec: configuration: authentication: oauthMetadata: name: oauth-meta oidcProviders: - claimMappings: ... issuer: audiences: - $CLIENT_ID issuerCertificateAuthority: name: keycloak-oidc-ca issuerURL: $KEYCLOAK_HOST/realms/master name: keycloak-oidc-test type: OIDC " Check KAS indeed already picks up the setting: $ oc logs -c kube-apiserver kube-apiserver-5c976d59f5-zbrwh -n clusters-hypershift-ci-18556 --kubeconfig /path/to/management-cluster/kubeconfig | grep "oidc-" ... I0130 08:07:24.266247 1 flags.go:64] FLAG: --oidc-ca-file="/etc/kubernetes/certs/oidc-ca/ca.crt" I0130 08:07:24.266251 1 flags.go:64] FLAG: --oidc-client-id="openshift-test-aud" ... I0130 08:07:24.266261 1 flags.go:64] FLAG: --oidc-issuer-url="https://keycloak-keycloak.apps.hypershift-ci-18556.xxx/realms/master" ... Wait about 15 mins. 4. Check COs and check oc login. Both show the same error: $ oc get co | grep -v 'True.*False.*False' NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE console 4.16.0-0.nightly-2024-01-29-233218 True True False 4h57m SyncLoopRefreshProgressing: Working toward version 4.16.0-0.nightly-2024-01-29-233218, 1 replicas available $ oc get po -n openshift-console NAME READY STATUS RESTARTS AGE console-547cf6bdbb-l8z9q 1/1 Running 0 4h55m console-54f88749d7-cv7ht 0/1 CrashLoopBackOff 9 (3m18s ago) 14m console-54f88749d7-t7x96 0/1 CrashLoopBackOff 9 (3m32s ago) 14m $ oc logs console-547cf6bdbb-l8z9q -n openshift-console I0130 03:23:36.788951 1 metrics.go:156] usage.Metrics: Update console users metrics: 0 kubeadmin, 0 cluster-admins, 0 developers, 0 unknown/errors (took 406.059196ms) E0130 06:48:32.745179 1 asynccache.go:43] failed a caching attempt: request to OAuth issuer endpoint https://:0/oauth/token failed: Head "https://:0": dial tcp :0: connect: connection refused E0130 06:53:32.757881 1 asynccache.go:43] failed a caching attempt: request to OAuth issuer endpoint https://:0/oauth/token failed: Head "https://:0": dial tcp :0: connect: connection refused ... $ oc login --exec-plugin=oc-oidc --client-id=openshift-test-aud --extra-scopes=email,profile --callback-port=8080 error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused Unable to connect to the server: getting credentials: exec: executable oc failed with exit code 1 5. Check root cause, the configured oauthMetadata is not picked up well: $ curl -k https://a6e149f24f8xxxxxx.elb.ap-east-1.amazonaws.com:6443/.well-known/oauth-authorization-server { "issuer": "https://:0", "authorization_endpoint": "https://:0/oauth/authorize", "token_endpoint": "https://:0/oauth/token", ... }
Actual results:
As above steps 4 and 5, the configured oauthMetadata is not picked up well, causing console and oc login hit the error.
Expected results:
The configured oauthMetadata is picked up well. No error.
Additional info:
For oc, if I manually use `oc config set-credentials oidc --exec-api-version=client.authentication.k8s.io/v1 --exec-command=oc --exec-arg=get-token --exec-arg="--issuer-url=$KEYCLOAK_HOST/realms/master" ...` instead of using `oc login --exec-plugin=oc-oidc ...`, oc authentication works well. This means my configuration is correct. $ oc whoami Please visit the following URL in your browser: http://localhost:8080 oidc-user-test:xxia@redhat.com
This is a clone of issue OCPBUGS-26069. The following is the description of the original issue:
—
Component Readiness has found a potential regression in [sig-arch][Late] operators should not create watch channels very often [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel].
Probability of significant regression: 98.46%
Sample (being evaluated) Release: 4.15
Start Time: 2023-12-29T00:00:00Z
End Time: 2024-01-04T23:59:59Z
Success Rate: 83.33%
Successes: 15
Failures: 3
Flakes: 0
Base (historical) Release: 4.14
Start Time: 2023-10-04T00:00:00Z
End Time: 2023-10-31T23:59:59Z
Success Rate: 98.36%
Successes: 120
Failures: 2
Flakes: 0
Description of problem:
In the developer sandbox, the happy path to create operator-backed resources is broken. Users can only work on their assigned namespace. When doing so, and attempting to create an Operator-backed resource from the Developer console, the user interface switches inadvertendly the working namespace from the user's to the `openshift` one. The console shows an error message when the user clicks the "create" button.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Login to the Developer Sandbox 2. Choose the Developer view 3. Click Add+ -> Developer Catalog -> Operator Backed 4. Filter by "integration" 5. Notice the working namespace is still the user's one. 6. Select "Integration" (Camel K operator) 7. Click "Create" 8. Notice the working namespace has switched to `openshift` 9. Notice the custom resource in YAML view includes `namespace: openshift` 10. Click "Create"
Actual results:
An error message shows: "Danger alert:An error occurredintegrations.camel.apache.org is forbidden: User "bmesegue" cannot create resource "integrations" in API group "camel.apache.org" in the namespace "openshift""
Expected results:
On step 8, the working directory should remain the user's one On step 9, in the YAML view, the namespace should be the user's one, or none. After step 10, the creation process should trigger the creation of a Camel K integration.
Additional info:
Description of problem:{code:none} Deploying a cluster results in: time="2023-10-30T19:10:59-04:00" level=debug msg="Apply complete! Resources: 0 added, 0 changed, 3 destroyed." time="2023-10-30T19:10:59-04:00" level=fatal msg="error destroying bootstrap resources failed disabling bootstrap load balancing: %!w(<nil>)"
Version-Release number of selected component (if applicable):
4.15.0
How reproducible:
Occasionally
Steps to Reproduce:
1. Deploy a PowerVS cluster in a zone with PER
Actual results:
Expected results:
It should deploy correctly
Additional info:
This is a clone of issue OCPBUGS-27094. The following is the description of the original issue:
—
Description of problem:
Based on this and this component readiness data that compares success rates for those two particular tests, we are regressing ~7-10% between the current 4.15 master and 4.14.z (iow. we made the product ~10% worse).
These jobs and their failures are all caused by increased etcd leader elections disrupting seemingly unrelated test cases across the VSphere AMD64 platform.
Since this particular platform's business significance is high, I'm setting this as "Critical" severity.
Please get in touch with me or Dean West if more teams need to be pulled into investigation and mitigation.
Version-Release number of selected component (if applicable):
4.15 / master
How reproducible:
Component Readiness Board
Actual results:
The etcd leader elections are elevated. Some jobs indicate it is due to disk i/o throughput OR network overload.
Expected results:
1. We NEED to understand what is causing this problem. 2. If we can mitigate this, we should. 3. If we cannot mitigate this, we need to document this or work with VSphere infrastructure provider to fix this problem. 4. We optionally need a way to measure how often this happens in our fleet so we can evaluate how bad it is.
Additional info:
Description of problem:
In HyperShift 4.14, the konnectivity server is run inside the kube-apiserver pod. When this pod is deleted for any reason, the konnectivity server container can drop before the rest of the pod terminates, which can cause network connections to drop. The following preStop definition can be added to the container to ensure it stays alive long enough for the rest of the pod to clean up. lifecycle: preStop: exec: command: - /bin/sh - -c - sleep 70
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Webpack-DevServer Hot-Reload Not Working due to recent update to nodejsv18
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When bootstrap logs are collected (e.g. as part of a CI run when bootstrapping fails), it no longer contains most of the Ironic services. They used to be run in standalone pods, but after a recent refactoring, they are systemd services.
Description of problem:
OKD/FCOS uses FCOS for its bootimage which lacks several tools and services such as oc and crio that the rendezvous host of the Agent-based Installer needs to set up a bootstrap control plane.
Version-Release number of selected component (if applicable):
4.13.0 4.14.0 4.15.0
[sig-arch][Late] operators should not create watch channels very often [apigroup:apiserver.openshift.io] [Suite:openshift/conformance/parallel]
{ fail [github.com/openshift/origin/test/extended/apiserver/api_requests.go:360]: Expected <[]string | len:1, cap:1>: [ "Operator \"cluster-node-tuning-operator\" produces more watch requests than expected: watchrequestcount=209, upperbound=184, ratio=1.14", ] to be empty Ginkgo exit error 1: exit with code 1}
Description of problem:
The go docs in the install-config's platform.aws.lbType is misleading as well as on the ingress object (oc explain ingresses.config.openshift.io.spec.loadBalancer.platform.aws.type). Both say: "When this field is specified, the default ingresscontroller will be created using the specified load-balancer type." That is true, but what is missing is that ALL ingresscontrollers will be created using the specified load-balancer type by default (not just the default ingresscontroller). This missing information can be confusing to users.
Version-Release number of selected component (if applicable):
4.12+
How reproducible:
100%
Steps to Reproduce:
openshift-install explain installconfig.platform.aws.lbType - or - oc explain ingresses.config.openshift.io.spec.loadBalancer.platform.aws.type
Actual results:
./openshift-install explain installconfig.platform.aws.lbType KIND: InstallConfig VERSION: v1RESOURCE: <string> LBType is an optional field to specify a load balancer type. When this field is specified, the default ingresscontroller will be created using the specified load-balancer type. ... [same with ingress.spec.loadBalancer.platform.aws.type]
Expected results:
My suggestion: ./openshift-install explain installconfig.platform.aws.lbType KIND: InstallConfig VERSION: v1RESOURCE: <string> LBType is an optional field to specify a load balancer type. When this field is specified, all ingresscontrollers (including the default ingresscontroller) will be created using the specified load-balancer type by default. ... [same with ingress.spec.loadBalancer.platform.aws.type]
Additional info:
Since the change should be the same thing for both the installconfig and ingress object, this bug would handle both.
Multus doesn't need to watch pods on other nodes. To save memory and CPU set MULTUS_NODE_NAME to filter pods that multus watches.
Please review the following PR: https://github.com/openshift/image-registry/pull/387
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-policy-controller/pull/131
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/machine-api-provider-gcp/pull/72
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
OCP Upgrades fail with message "Upgrade error from 4.13.X: Unable to apply 4.14.0-X: an unknown error has occurred: MultipleErrors"
Version-Release number of selected component (if applicable):
Currently 4.14.0-rc.1, but we observed the same issue with previous 4.14 nightlies too: 4.14.0-0.nightly-2023-09-12-195514 4.14.0-0.nightly-2023-09-02-132842 4.14.0-0.nightly-2023-08-28-154013
How reproducible:
1 out of 2 upgrades
Steps to Reproduce:
1. Deploy OCP 4.13 with latest GA on a baremetal cluster with IPI and OVN-K 2. Upgrade to latest 4.14 available 3. Check cluster version status during the upgrade, at some point upgrade stops with message: "Upgrade error from 4.13.X Unable to apply 4.14.0-X: an unknown error has occurred: MultipleErrors" 4. Check OVN pods "oc get pods -n openshift-ovn-kubernetes", there are pods running 7 out 8 containers (missing ovnkube-node) constantly restarting, and pods running only 5 containers that show errors to connect to the OVN DBs. 5. Check cluster operators "oc get co" mainly dns, network, and machine-config remained in 4.13 and degraded.
Actual results:
Upgrade not completed, and OVN pods remain in a restarting loop with failures.
Expected results:
Upgrade should be completed without issues, and OVN pods should remain in a Running status without restarts.
Additional info:
These are the results from our latest test from 4.13.13 to 4.14.0-rc1
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version True True 2h8m Unable to apply 4.14.0-rc.1: an unknown error has occurred: MultipleErrors $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-ebb1da47ad5cb76c396983decb7df1ea True False False 3 3 3 0 3h41m worker rendered-worker-26ccb35941236935a570dddaa0b699db False True True 3 2 2 1 3h41m $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.14.0-rc.1 True False False 2h21m baremetal 4.14.0-rc.1 True False False 3h38m cloud-controller-manager 4.14.0-rc.1 True False False 3h41m cloud-credential 4.14.0-rc.1 True False False 2h23m cluster-autoscaler 4.14.0-rc.1 True False False 2h21m config-operator 4.14.0-rc.1 True False False 3h40m console 4.14.0-rc.1 True False False 2h20m control-plane-machine-set 4.14.0-rc.1 True False False 3h40m csi-snapshot-controller 4.14.0-rc.1 True False False 2h21m dns 4.13.13 True True True 2h9m etcd 4.14.0-rc.1 True False False 2h40m image-registry 4.14.0-rc.1 True False False 2h9m ingress 4.14.0-rc.1 True True True 1h14m insights 4.14.0-rc.1 True False False 3h34m kube-apiserver 4.14.0-rc.1 True False False 2h35m kube-controller-manager 4.14.0-rc.1 True False False 2h30m kube-scheduler 4.14.0-rc.1 True False False 2h29m kube-storage-version-migrator 4.14.0-rc.1 False True False 2h9m machine-api 4.14.0-rc.1 True False False 2h24m machine-approver 4.14.0-rc.1 True False False 3h40m machine-config 4.13.13 True False True 59m marketplace 4.14.0-rc.1 True False False 3h40m monitoring 4.14.0-rc.1 False True True 2h3m network 4.13.13 True True True 2h4m node-tuning 4.14.0-rc.1 True False False 2h9m openshift-apiserver 4.14.0-rc.1 True False False 2h20m openshift-controller-manager 4.14.0-rc.1 True False False 2h20m openshift-samples 4.14.0-rc.1 True False False 2h23m operator-lifecycle-manager 4.14.0-rc.1 True False False 2h23m operator-lifecycle-manager-catalog 4.14.0-rc.1 True False False 2h18m operator-lifecycle-manager-packageserver 4.14.0-rc.1 True False False 2h20m service-ca 4.14.0-rc.1 True False False 2h23m storage 4.14.0-rc.1 True False False 3h40m
Some OVN pods are running 7 out 8 containers (missing ovnkube-node) constantly restarting, and pods running only 5 containers that show errors to connect to the OVN DBs.
$ oc get pods -n openshift-ovn-kubernetes -o wide NAME READY STATUS RESTARTS AGE IP NODE ovnkube-control-plane-5f5c598768-czkjv 2/2 Running 0 2h16m 192.168.16.32 dciokd-master-1 ovnkube-control-plane-5f5c598768-kg69r 2/2 Running 0 2h16m 192.168.16.31 dciokd-master-0 ovnkube-control-plane-5f5c598768-prfb5 2/2 Running 0 2h16m 192.168.16.33 dciokd-master-2 ovnkube-node-9hjv9 5/5 Running 1 3h43m 192.168.16.32 dciokd-master-1 ovnkube-node-fmswc 7/8 Running 19 2h10m 192.168.16.36 dciokd-worker-2 ovnkube-node-pcjhp 7/8 Running 20 2h15m 192.168.16.35 dciokd-worker-1 ovnkube-node-q7kcj 5/5 Running 1 3h43m 192.168.16.33 dciokd-master-2 ovnkube-node-qsngm 5/5 Running 3 3h27m 192.168.16.34 dciokd-worker-0 ovnkube-node-v2d4h 7/8 Running 20 2h15m 192.168.16.31 dciokd-master-0 $ oc logs ovnkube-node-9hjv9 -c ovnkube-node -n openshift-ovn-kubernetes | less ... 2023-09-19T03:40:23.112699529Z E0919 03:40:23.112660 5883 ovn_db.go:511] Failed to retrieve cluster/status info for database "OVN_Northbound", stderr: 2023-09-19T03:40:23Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnnb_db.ctl 2023-09-19T03:40:23.112699529Z ovn-appctl: cannot connect to "/var/run/ovn/ovnnb_db.ctl" (No such file or directory) 2023-09-19T03:40:23.112699529Z , err: (OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=5 cluster/status OVN_Northbound' failed: exit status 1) 2023-09-19T03:40:23.112699529Z E0919 03:40:23.112677 5883 ovn_db.go:590] OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=5 cluster/status OVN_Northbound' failed: exit status 1 2023-09-19T03:40:23.114791313Z E0919 03:40:23.114777 5883 ovn_db.go:283] Failed retrieving memory/show output for "OVN_NORTHBOUND", stderr: 2023-09-19T03:40:23Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnnb_db.ctl 2023-09-19T03:40:23.114791313Z ovn-appctl: cannot connect to "/var/run/ovn/ovnnb_db.ctl" (No such file or directory) 2023-09-19T03:40:23.114791313Z , err: (OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnnb_db.ctl --timeout=5 memory/show' failed: exit status 1) 2023-09-19T03:40:23.116492808Z E0919 03:40:23.116478 5883 ovn_db.go:511] Failed to retrieve cluster/status info for database "OVN_Southbound", stderr: 2023-09-19T03:40:23Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnsb_db.ctl 2023-09-19T03:40:23.116492808Z ovn-appctl: cannot connect to "/var/run/ovn/ovnsb_db.ctl" (No such file or directory) 2023-09-19T03:40:23.116492808Z , err: (OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=5 cluster/status OVN_Southbound' failed: exit status 1) 2023-09-19T03:40:23.116492808Z E0919 03:40:23.116488 5883 ovn_db.go:590] OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=5 cluster/status OVN_Southbound' failed: exit status 1 2023-09-19T03:40:23.118468064Z E0919 03:40:23.118450 5883 ovn_db.go:283] Failed retrieving memory/show output for "OVN_SOUTHBOUND", stderr: 2023-09-19T03:40:23Z|00001|unixctl|WARN|failed to connect to /var/run/ovn/ovnsb_db.ctl 2023-09-19T03:40:23.118468064Z ovn-appctl: cannot connect to "/var/run/ovn/ovnsb_db.ctl" (No such file or directory) 2023-09-19T03:40:23.118468064Z , err: (OVN command '/usr/bin/ovn-appctl -t /var/run/ovn/ovnsb_db.ctl --timeout=5 memory/show' failed: exit status 1) 2023-09-19T03:40:25.118085671Z E0919 03:40:25.118056 5883 ovn_northd.go:128] Failed to get ovn-northd status stderr() :(failed to run the command since failed to get ovn-northd's pid: open /var/run/ovn/ovn-northd.pid: no such file or directory)
Please review the following PR: https://github.com/openshift/openshift-state-metrics/pull/111
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
reason/DisruptionBegan request-audit-id/91e612b4-dd19-4783-ad62-46c55bbdaee4 backend-disruption-name/oauth-api-reused-connections connection/reused disruption/openshift-tests stopped responding to GET requests over reused connections: error running request: 500 Internal Server Error: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"etcdserver: leader changed","code":500}
Feels like there's something here we could dig into.
Most common on azure.
May show up in search.ci as well to help find the jobs more easily?
Please review the following PR: https://github.com/openshift/egress-router-cni/pull/76
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
When we apply a machine config with additional ssh key info, this action only needs to uncordon the node, when uncordon is happening, condition Cordoned = True. it will make the user confuse. maybe we can refine this design to show status of cordon/uncordon separately
lastTransitionTime: '2023-11-28T16:53:58Z' message: 'Action during previous iteration: (Un)Cordoned node. The node is reporting Unschedulable = false' reason: UpdateCompleteCordoned status: 'False' type: Cordoned
Please review the following PR: https://github.com/openshift/azure-file-csi-driver/pull/34
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
vSphere Dual-stack install fails in bootstrap.
All nodes are node.cloudprovider.kubernetes.io/uninitialized
cloud-controller-manager can't find the nodes?
I0906 15:05:22.922183 1 search.go:49] WhichVCandDCByNodeID called but nodeID is empty E0906 15:05:22.922187 1 nodemanager.go:197] shakeOutNodeIDLookup failed. Err=nodeID is empty
Version-Release number of selected component (if applicable):
4.14.0-0.ci.test-2023-09-06-141839-ci-ln-98f4iqb-latest
How reproducible:
Always
Steps to Reproduce:
1. Install vSphere IPI with OVN Dual-stack
platform: vsphere: apiVIPs: - 192.168.134.3 - fd65:a1a8:60ad:271c::200 ingressVIPs: - 192.168.134.4 - fd65:a1a8:60ad:271c::201 networking: networkType: OVNKubernetes machineNetwork: - cidr: 192.168.0.0/16 - cidr: fd65:a1a8:60ad:271c::/64 clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 - cidr: fd65:10:128::/56 hostPrefix: 64 serviceNetwork: - 172.30.0.0/16 - fd65:172:16::/112
Actual results:
Install fails in bootstrap
Expected results:
Install succeeds
Additional info:
I0906 15:03:21.393629 1 search.go:69] WhichVCandDCByNodeID by UUID I0906 15:03:21.393632 1 search.go:76] WhichVCandDCByNodeID nodeID: 421b78c3-f8bb-970c-781b-76827306e89e I0906 15:03:21.406797 1 search.go:208] Found node 421b78c3-f8bb-970c-781b-76827306e89e I0906 15:03:21.406816 1 search.go:210] Hostname: ci-ln-bllxr6t-c1627-5p7mq-master-2, UUID: 421b78c3-f8bb-970c-781b-76827306e89e I0906 15:03:21.406830 1 nodemanager.go:159] Discovered VM using normal UUID format I0906 15:03:21.416168 1 nodemanager.go:268] Adding Hostname: ci-ln-bllxr6t-c1627-5p7mq-master-2 I0906 15:03:21.416218 1 nodemanager.go:438] Adding Internal IP: 192.168.134.60 I0906 15:03:21.416229 1 nodemanager.go:443] Adding External IP: 192.168.134.60 I0906 15:03:21.416244 1 nodemanager.go:349] Found node 421b78c3-f8bb-970c-781b-76827306e89e I0906 15:03:21.416266 1 nodemanager.go:351] Hostname: ci-ln-bllxr6t-c1627-5p7mq-master-2 UUID: 421b78c3-f8bb-970c-781b-76827306e89e I0906 15:03:21.416278 1 instances.go:77] instances.NodeAddressesByProviderID() FOUND with 421b78c3-f8bb-970c-781b-76827306e89e E0906 15:03:21.416326 1 node_controller.go:236] error syncing 'ci-ln-bllxr6t-c1627-5p7mq-master-2': failed to get node modifiers from cloud provider: provided node ip for node "ci-ln-bllxr6t-c1627-5p7mq-master-2" is not valid: failed to get node address from cloud provider that matches ip: fd65:a1a8:60ad:271c::70, requeuing I0906 15:03:21.623573 1 instances.go:102] instances.InstanceID() CACHED with ci-ln-bllxr6t-c1627-5p7mq-master-1
Description of problem:
In the implementation of METAL-163, the support for the new Ironic Node field external_http_url was only added for floppy-based configuration images, not for CD images that we use in OpenShift. This makes external_http_url a no-op.
cluster-capi-operator is incorrectly updating the container command to /bin/cluster-api-provider-openstack-manager. It should leave it alone because it is already correct.
Description of problem:
The agent-based installer does not support the TechPreviewNoUpgrade featureSet, and by extension nor does it support any of the features gated by it. Because of this, there is no warning about one of these features being specified - we expect the TechPreviewNoUpgrade feature gate to error out when any of them are used.
However, we don't warn about TechPreviewNoUpgrade itself being ignored, so if the user does specify it then they can use some of these non-supported features without being warned that their configuration is ignored.
We should fail with an error when TechPreviewNoUpgrade is specified, until such time as AGENT-554 is implemented.
Please review the following PR: https://github.com/openshift/route-override-cni/pull/48
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/baremetal-operator/pull/323
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
After a manual crash of a OCP node the OSPD VM running on the OCP node is stuck in terminating state
Version-Release number of selected component (if applicable):
OCP 4.12.15 osp-director-operator.v1.3.0 kubevirt-hyperconverged-operator.v4.12.5
How reproducible:
Login to a OCP 4.12.15 Node running a VM Manually crash the master node. After reboot the VM stay in terminating state
Steps to Reproduce:
1. ssh core@masterX 2. sudo su 3. echo c > /proc/sysrq-trigger
Actual results:
After reboot the VM stay in terminating state $ omc get node|sed -e 's/modl4osp03ctl/model/g' | sed -e 's/telecom.tcnz.net/aaa.bbb.ccc/g' NAME STATUS ROLES AGE VERSION model01.aaa.bbb.ccc Ready control-plane,master,worker 91d v1.25.8+37a9a08 model02.aaa.bbb.ccc Ready control-plane,master,worker 91d v1.25.8+37a9a08 model03.aaa.bbb.ccc Ready control-plane,master,worker 91d v1.25.8+37a9a08 $ omc get pod -n openstack NAME READY STATUS RESTARTS AGE openstack-provision-server-7b79fcc4bd-x8kkz 2/2 Running 0 8h openstackclient 1/1 Running 0 7h osp-director-operator-controller-manager-5896b5766b-sc7vm 2/2 Running 0 8h osp-director-operator-index-qxxvw 1/1 Running 0 8h virt-launcher-controller-0-9xpj7 1/1 Running 0 20d virt-launcher-controller-1-5hj9x 1/1 Running 0 20d virt-launcher-controller-2-vhd69 0/1 NodeAffinity 0 43d $ omc describe pod virt-launcher-controller-2-vhd69 |grep Status: Status: Terminating (lasts 37h) $ xsos sosreport-xxxx/|grep time ... Boot time: Wed Nov 22 01:44:11 AM UTC 2023 Uptime: 8:27, 0 users
Expected results:
VM restart automatically OR does not stay in Terminating state
Additional info:
The issue has been seen two time. First time, a crash of the kernel occured and we had the associated VM on the node in terminating state Second time we try to reproduce the issue by crashing manually the kernel and we got the same result. The VM running on the OCP node stay in terminating state
When we try to create a cluster with --secret-creds, an MCE AWS k8s secret that includes aws-creds, pull secret, and base domain, then the binary should not ask for pull secret. However, it does now after changing from hypershift.
Adding pull secret param will allow the command to continue as expected, though I would think whole point of the secret-creds is to reuse what exists.
/usr/local/bin/hcp create cluster aws --name acmqe-hc-ad5b1f645d93464c --secret-creds test1-cred --region us-east-1 --node-pool-replicas 1 --namespace local-cluster --instance-type m6a.xlarge --release-image quay.io/openshift-release-dev/ocp-release:4.14.0-ec.4-multi --generate-ssh Output: Error: required flag(s) "pull-secret" not set required flag(s) "pull-secret" not set
2.4.0-DOWNANDBACK-2023-08-31-13-34-02 or mce 2.4.0-137
hcp version openshift/hypershift: 8b4b52925d47373f3fe4f0d5684c88dc8a93368a. Latest supported OCP: 4.14.0
always
Description of problem:
The problem was that namespace handler on initial sync would delete all ports (because logical port cache where it got lsp UUIDs wasn't populated) and all acls (they were just set to nil). Even though both ports and acls will be re-added by the corresponding handlers, it may cause disruption.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. create a namespace with at least 1 pod and egress firewall in it
2. pick any ovnkube-node pod, find namespace port group UUID in nbdb by external_ids["name"]=<namespace name>, e.g. for "test" namespace
_uuid : 6142932d-4084-4bc3-bdcb-1990fc71891b acls : [ab2be619-1266-41c2-bb1d-1052cb4e1e97, b90a4b4a-ceee-41ee-a801-08c37a9bf3e7, d314fa8d-7b5a-40a5-b3d4-31091d7b9eae] external_ids : {name=test} name : a18007334074686647077 ports : [55b700e4-8176-42e7-97a6-8b32a82fefe5, cb71739c-ad6c-4436-8fd6-0643a5417c7d, d8644bf1-6bed-4db7-abf8-7aaab0625324]
3. restart chosen ovn-k pod
4. check logs on restart that update chosen port group to have zero ports and zero acls
Update operations generated as: [{Op:update Table:Port_Group Row:map[acls:{GoSet:[]} external_ids:{GoMap:map[name:test]} ports:{GoSet:[]}] Rows:[] Columns:[] Mutations:[] Timeout:<nil> Where:[where column _uuid == {6142932d-4084-4bc3-bdcb-1990fc71891b}] Until: Durable:<nil> Comment:<nil> Lock:<nil> UUID: UUIDName:}]
Actual results:
Expected results:
On restart port group stays the same, no extra update with empty ports and acls is generated
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
This is a clone of issue OCPBUGS-22399. The following is the description of the original issue:
—
Users are encountering an issue when attempting to "Create hostedcluster on BM+disconnected+ipv6 through MCE." This issue is related to the default settings of `--enable-uwm-telemetry-remote-write` being true. Which might mean that that in the default case with disconnected and whatever is configured in the configmap for UWM e.g ( minBackoff: 1s url: https://infogw.api.openshift.com/metrics/v1/receive Is not reachable with disconneced. So we should look into reporting the issue and remdiating vs. Fataling on it for disconnected scenarios.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
In MCE 2.4, we currently document to disable `--enable-uwm-telemetry-remote-write` if the hosted control plane feature is used in a disconnected environment. https://github.com/stolostron/rhacm-docs/blob/lahinson-acm-7739-disconnected-bare-[…]s/hosted_control_planes/monitor_user_workload_disconnected.adoc Once this Jira is fixed, the documentation needs to be removed, users do not need to disable `--enable-uwm-telemetry-remote-write`. The HO is expected to fail gracefully on `--enable-uwm-telemetry-remote-write` and continue to be operational.
Description of problem:
https://redhat-internal.slack.com/archives/C061SJRTKDG/p1697798046548799 In some ocm envs the latest HO is stuck onreconciliating CAPI provider for some 4.12 HCs {"level":"error","ts":"2023-10-20T10:53:27Z","msg":"Reconciler error","controller":"hostedcluster","controllerGroup":"hypershift.openshift.io","controllerKind":"HostedCluster","HostedCluster":{"name":"build08","namespace":"ocm-production-23qm3j1pkslelghufgs874g86ccn5sba"},"namespace":"ocm-production-23qm3j1pkslelghufgs874g86ccn5sba","name":"build08","reconcileID":"482f297f-8afb-407c-96d9-bc1de727ef78","error":"failed to reconcile capi provider: failed to reconcile capi provider deployment: Deployment.apps \"capi-provider\" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{\"app\":\"capi-provider-controller-manager\", \"control-plane\":\"capi-provider-controller-manager\", \"hypershift.openshift.io/control-plane-component\":\"capi-provider-controller-manager\"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/opt/app-root/src/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:326\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/opt/app-root/src/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/opt/app-root/src/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:234"}
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
reconciliation is tuck
Expected results:
reconciliation succeeds
Additional info:
Description of problem:
enable UWM and enable UWM alertmanager
$ oc -n openshift-monitoring get cm cluster-monitoring-config -oyaml apiVersion: v1 data: config.yaml: | enableUserWorkload: true kind: ConfigMap metadata: creationTimestamp: "2023-08-17T06:02:36Z" name: cluster-monitoring-config namespace: openshift-monitoring resourceVersion: "259151" uid: a9365c21-5c1d-4c91-98ee-f074b023dd31 $ oc -n openshift-user-workload-monitoring get cm user-workload-monitoring-config -oyaml apiVersion: v1 data: config.yaml: | alertmanager: enabled: true kind: ConfigMap metadata: creationTimestamp: "2023-08-17T06:02:44Z" labels: app.kubernetes.io/managed-by: cluster-monitoring-operator app.kubernetes.io/part-of: openshift-monitoring name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring resourceVersion: "148193" uid: b3c6e5a6-ff7b-4ae4-85eb-28be683119e4 $ oc -n openshift-user-workload-monitoring get pod NAME READY STATUS RESTARTS AGE alertmanager-user-workload-0 6/6 Running 0 4h50m alertmanager-user-workload-1 6/6 Running 0 4h50m prometheus-operator-77bcdcbd9c-7nt6v 2/2 Running 0 6h14m prometheus-user-workload-0 6/6 Running 0 6h14m prometheus-user-workload-1 6/6 Running 0 6h14m thanos-ruler-user-workload-0 4/4 Running 0 4h50m thanos-ruler-user-workload-1 4/4 Running 0 4h50m
kubeadmin user create namespace and PrometheusRule, the alert could be fired
apiVersion: v1 kind: Namespace metadata: name: ns1 --- apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: example-alert namespace: ns1 spec: groups: - name: example rules: - alert: TestAlert expr: vector(1) labels: severity: none annotations: message: This is an alert meant to ensure that the entire alerting pipeline is functional.
could see the alerts from UWM alertmanager
$ token=`oc create token prometheus-k8s -n openshift-monitoring` $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://alertmanager-user-workload.openshift-user-workload-monitoring.svc:9095/api/v2/alerts' | jq [ { "annotations": { "message": "This is an alert meant to ensure that the entire alerting pipeline is functional." }, "endsAt": "2023-08-17T12:08:41.558Z", "fingerprint": "348490d73f8513a0", "receivers": [ { "name": "Default" } ], "startsAt": "2023-08-17T12:04:11.558Z", "status": { "inhibitedBy": [], "silencedBy": [], "state": "active" }, "updatedAt": "2023-08-17T12:04:41.583Z", "generatorURL": "https://thanos-querier-openshift-monitoring.apps.***/api/graph?g0.expr=vector%281%29&g0.tab=1", "labels": { "alertname": "TestAlert", "namespace": "ns1", "severity": "none" } } ]
open another terminal, or another person execute following commands in his terminal
##### login with common user, deploy pod to project is only for we can use curl command # oc login https://${api_server}:6443 -u ${user} -p ${password} # oc new-project test # oc -n test new-app rails-postgresql-example # oc -n test get pod NAME READY STATUS RESTARTS AGE postgresql-1-deploy 0/1 Completed 0 13m postgresql-1-v4lz5 1/1 Running 0 13m rails-postgresql-example-1-build 0/1 Completed 0 13m rails-postgresql-example-1-crdbq 1/1 Running 0 9m20s rails-postgresql-example-1-deploy 0/1 Completed 0 9m42s rails-postgresql-example-1-hook-pre 0/1 Completed 0 9m39s # token=`oc whoami -t` # echo $token sha256~EJCVjflM6lbsl8plKkU7Hv0swkQMxySJr5BGXRJaKhU
user could see the alert from UWM alertmanager service
# oc -n test exec postgresql-1-v4lz5 -- curl -k -H "Authorization: Bearer $token" 'https://alertmanager-user-workload.openshift-user-workload-monitoring.svc:9095/api/v2/alerts' | jq [ { "annotations": { "message": "This is an alert meant to ensure that the entire alerting pipeline is functional." }, "endsAt": "2023-08-17T12:16:56.558Z", "fingerprint": "348490d73f8513a0", "receivers": [ { "name": "Default" } ], "startsAt": "2023-08-17T12:04:11.558Z", "status": { "inhibitedBy": [], "silencedBy": [], "state": "active" }, "updatedAt": "2023-08-17T12:12:56.563Z", "generatorURL": "https://thanos-querier-openshift-monitoring.apps.***/api/graph?g0.expr=vector%281%29&g0.tab=1", "labels": { "alertname": "TestAlert", "namespace": "ns1", "severity": "none" } } ]
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-08-16-114741
How reproducible:
always
Steps to Reproduce:
1. see the description
Actual results:
common user can view UWM alertmanager alerts
Expected results:
Additional info:
if this is expected, we could close the bug
How reproducible:
Always
Steps to Reproduce:
1. the Kubernetes API introduces a new Pod Template parameter (`ephemeral`) 2. this parameter is not in the allowed list of the default SCC 3. customer is not allowed to edit the default SCCs nor we have a mechanism in place to update the built in SCCs AFAIK 4. users of existing clusters cannot use the new parameter without creating manual SCCs and assigning this SCC to service accounts themselves which looks clunky. This is documented in https://access.redhat.com/articles/6967808
Actual results:
Users of existing clusters cannot use ephemeral volumes after an upgrade
Expected results:
Users of existing clusters *can* use ephemeral volumes after an upgrade
Current status
Description of problem:
We should document how to preserve kebab menu in the TableData component when building a list page for a dynamic plugin. Currently {className: "pf-c-table__action", id: ""} need to be set on the component in order for the column to be preserved, which is definitely not obvious for plugin creators. There is also an upstream issue which should address this issue, either with making the setting more obvious or at least better documented. Either way we should be documenting the current state in our docs/code/examples.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The official openshift doc does not contain this issue https://docs.openshift.com/container-platform/4.14/installing/installing_openstack/installing-openstack-user.html
Only the upstream docs has it.
Issue 43 from https://docs.google.com/spreadsheets/d/1TR3ENY-GE_LQL9F-xH6NahtRHdu6IrYGhLqDDA8y0EI/edit#gid=1035185624
The active and hover states for the typology list view is incorrect
Screenshot: https://drive.google.com/file/d/1DMwmYsvdHXvMBYr0gOD9mActmJNMaH6z/view?usp=share_link
Please review the following PR: https://github.com/openshift/machine-config-operator/pull/4152
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Install IPI cluster against 4.15 nightly build on Azure MAG and Azure Stack Hub or with Azure workload identity, image-registry co is degraded with different errors. On MAG: $ oc get co image-registry NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE image-registry 4.15.0-0.nightly-2024-02-16-235514 True False True 5h44m AzurePathFixControllerDegraded: Migration failed: panic: Get "https://imageregistryjima41xvvww.blob.core.windows.net/jima415a-hfxfh-image-registry-vbibdmawmsvqckhvmmiwisebryohfbtm?comp=list&prefix=docker&restype=container": dial tcp: lookup imageregistryjima41xvvww.blob.core.windows.net on 172.30.0.10:53: no such host... $ oc get pod -n openshift-image-registry NAME READY STATUS RESTARTS AGE azure-path-fix-ssn5w 0/1 Error 0 5h47m cluster-image-registry-operator-86cdf775c7-7brn6 1/1 Running 1 (5h50m ago) 5h58m image-registry-5c6796b86d-46lvx 1/1 Running 0 5h47m image-registry-5c6796b86d-9st5d 1/1 Running 0 5h47m node-ca-48lsh 1/1 Running 0 5h44m node-ca-5rrsl 1/1 Running 0 5h47m node-ca-8sc92 1/1 Running 0 5h47m node-ca-h6trz 1/1 Running 0 5h47m node-ca-hm7s2 1/1 Running 0 5h47m node-ca-z7tv8 1/1 Running 0 5h44m $ oc logs azure-path-fix-ssn5w -n openshift-image-registry panic: Get "https://imageregistryjima41xvvww.blob.core.windows.net/jima415a-hfxfh-image-registry-vbibdmawmsvqckhvmmiwisebryohfbtm?comp=list&prefix=docker&restype=container": dial tcp: lookup imageregistryjima41xvvww.blob.core.windows.net on 172.30.0.10:53: no such hostgoroutine 1 [running]: main.main() /go/src/github.com/openshift/cluster-image-registry-operator/cmd/move-blobs/main.go:49 +0x125 The blob storage endpoint seems not correct, should be: $ az storage account show -n imageregistryjima41xvvww -g jima415a-hfxfh-rg --query primaryEndpoints { "blob": "https://imageregistryjima41xvvww.blob.core.usgovcloudapi.net/", "dfs": "https://imageregistryjima41xvvww.dfs.core.usgovcloudapi.net/", "file": "https://imageregistryjima41xvvww.file.core.usgovcloudapi.net/", "internetEndpoints": null, "microsoftEndpoints": null, "queue": "https://imageregistryjima41xvvww.queue.core.usgovcloudapi.net/", "table": "https://imageregistryjima41xvvww.table.core.usgovcloudapi.net/", "web": "https://imageregistryjima41xvvww.z2.web.core.usgovcloudapi.net/" } On Azure Stack Hub: $ oc get co image-registry NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE image-registry 4.15.0-0.nightly-2024-02-16-235514 True False True 3h32m AzurePathFixControllerDegraded: Migration failed: panic: open : no such file or directory... $ oc get pod -n openshift-image-registry NAME READY STATUS RESTARTS AGE azure-path-fix-8jdg7 0/1 Error 0 3h35m cluster-image-registry-operator-86cdf775c7-jwnd4 1/1 Running 1 (3h38m ago) 3h54m image-registry-658669fbb4-llv8z 1/1 Running 0 3h35m image-registry-658669fbb4-lmfr6 1/1 Running 0 3h35m node-ca-2jkjx 1/1 Running 0 3h35m node-ca-dcg2v 1/1 Running 0 3h35m node-ca-q6xmn 1/1 Running 0 3h35m node-ca-r46r2 1/1 Running 0 3h35m node-ca-s8jkb 1/1 Running 0 3h35m node-ca-ww6ql 1/1 Running 0 3h35m $ oc logs azure-path-fix-8jdg7 -n openshift-image-registry panic: open : no such file or directorygoroutine 1 [running]: main.main() /go/src/github.com/openshift/cluster-image-registry-operator/cmd/move-blobs/main.go:36 +0x145 On cluster with Azure workload identity: Some operator's PROGRESSING is True image-registry 4.15.0-0.nightly-2024-02-16-235514 True True False 43m Progressing: The deployment has not completed... pod azure-path-fix is in CreateContainerConfigError status, and get error in its Event. "state": { "waiting": { "message": "couldn't find key REGISTRY_STORAGE_AZURE_ACCOUNTKEY in Secret openshift-image-registry/image-registry-private-configuration", "reason": "CreateContainerConfigError" } }
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-02-16-235514
How reproducible:
Always
Steps to Reproduce:
1. Install IPI cluster on MAG or Azure Stack Hub or config Azure workload identity 2. 3.
Actual results:
Installation failed and image-registry operator is degraded
Expected results:
Installation is successful.
Additional info:
Seems that issue is related with https://github.com/openshift/image-registry/pull/393
Description of problem:
If nmstatectl is not present, print "install nmstate" in error message
Version-Release number of selected component (if applicable):
4.13
How reproducible:
100%
Steps to Reproduce:
1. 2. 3.
Actual results:
FATAL * failed to validate network yaml for host 0, failed to execute 'nmstatectl gc', error: exec: "nmstatectl": executable file not found in $PATH
Expected results:
FATAL * failed to validate network yaml for host 0, install nmstate package, exec: "nmstatectl": executable file not found in $PATH
Additional info:
Please review the following PR: https://github.com/openshift/vertical-pod-autoscaler-operator/pull/147
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Important: ART has recorded in their product data that bugs for
this component should be opened against Jira project "OCPBUGS" and
component "Node / Autoscaler (HPA, VPA, CMA)". This project or component does not exist. Jira
should either be updated to include this component or @release-artists should be
notified of the proper mapping in the #forum-ocp-art Slack channel.
Component name: ose-vertical-pod-autoscaler-operator-container .
Jira mapping: https://github.com/openshift-eng/ocp-build-data/blob/main/product.yml
Please review the following PR: https://github.com/openshift/cluster-samples-operator/pull/517
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-23306. The following is the description of the original issue:
—
Related with https://issues.redhat.com/browse/OCPBUGS-23000
Cluster-autoscaler by default evict all those pods -including those coming from daemon sets- In the case of EFS-CSI drivers, which are mounted as nfs volumes, this is causing nfs stale and that application worloads are not terminated gracefully.
Version-Release number of selected component (if applicable):
4.11
How reproducible:
- While scaling down a node from the cluster-autoscaler-operator, the DS pods are beeing evicted.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
CSI pods might not be evicted by the cluster autoscaler (at least prior to workloads termination) as it might produce data corruption
Additional info:
Is possible to disable csi pods eviction adding the following annotation label on the csi driver pod cluster-autoscaler.kubernetes.io/enable-ds-eviction: "false"
This is a clone of issue OCPBUGS-28665. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
If not installed capability operator build and deploymentconfig, when use `oc new-app registry.redhat.io/<namespace>/<image>:<tag>` , the created deployment emptied spec.containers[0].image. The deploy will fail to start pod.
Version-Release number of selected component (if applicable):
oc version Client Version: 4.14.0-0.nightly-2023-08-22-221456 Kustomize Version: v5.0.1 Server Version: 4.14.0-0.nightly-2023-09-02-132842 Kubernetes Version: v1.27.4+2c83a9f
How reproducible:
Always
Steps to Reproduce:
1. Installed cluster without build/deploymentconfig function Set "baselineCapabilitySet: None" in install-config 2.Create a deploy using 'new-app' cmd oc new-app registry.redhat.io/ubi8/httpd-24:latest 3.
Actual results:
2. $oc new-app registry.redhat.io/ubi8/httpd-24:latest --> Found container image c412709 (11 days old) from registry.redhat.io for "registry.redhat.io/ubi8/httpd-24:latest" Apache httpd 2.4 ---------------- Apache httpd 2.4 available as container, is a powerful, efficient, and extensible web server. Apache supports a variety of features, many implemented as compiled modules which extend the core functionality. These can range from server-side programming language support to authentication schemes. Virtual hosting allows one Apache installation to serve many different Web sites. Tags: builder, httpd, httpd-24 * An image stream tag will be created as "httpd-24:latest" that will track this image--> Creating resources ... imagestream.image.openshift.io "httpd-24" created deployment.apps "httpd-24" created service "httpd-24" created --> Success Application is not exposed. You can expose services to the outside world by executing one or more of the commands below: 'oc expose service/httpd-24' Run 'oc status' to view your app 3. oc get deploy -o yaml apiVersion: v1 items: - apiVersion: apps/v1 kind: Deployment metadata: annotations: deployment.kubernetes.io/revision: "1" image.openshift.io/triggers: '[{"from":{"kind":"ImageStreamTag","name":"httpd-24:latest"},"fieldPath":"spec.template.spec.containers[?(@.name==\"httpd-24\")].image"}]' openshift.io/generated-by: OpenShiftNewApp creationTimestamp: "2023-09-04T07:44:01Z" generation: 1 labels: app: httpd-24 app.kubernetes.io/component: httpd-24 app.kubernetes.io/instance: httpd-24 name: httpd-24 namespace: wxg resourceVersion: "115441" uid: 909d0c4e-180c-4f88-8fb5-93c927839903 spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: deployment: httpd-24 strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: annotations: openshift.io/generated-by: OpenShiftNewApp creationTimestamp: null labels: deployment: httpd-24 spec: containers: - image: ' ' imagePullPolicy: IfNotPresent name: httpd-24 ports: - containerPort: 8080 protocol: TCP - containerPort: 8443 protocol: TCP resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 status: conditions: - lastTransitionTime: "2023-09-04T07:44:01Z" lastUpdateTime: "2023-09-04T07:44:01Z" message: Created new replica set "httpd-24-7f6b55cc85" reason: NewReplicaSetCreated status: "True" type: Progressing - lastTransitionTime: "2023-09-04T07:44:01Z" lastUpdateTime: "2023-09-04T07:44:01Z" message: Deployment does not have minimum availability. reason: MinimumReplicasUnavailable status: "False" type: Available - lastTransitionTime: "2023-09-04T07:44:01Z" lastUpdateTime: "2023-09-04T07:44:01Z" message: 'Pod "httpd-24-7f6b55cc85-pvvgt" is invalid: spec.containers[0].image: Invalid value: " ": must not have leading or trailing whitespace' reason: FailedCreate status: "True" type: ReplicaFailure observedGeneration: 1 unavailableReplicas: 1 kind: List metadata:
Expected results:
Should set spec.containers[0].image to registry.redhat.io/ubi8/httpd-24:latest
Additional info:
Description of problem:
There is no instance type validation check under defaultMachinePlatform. For example, set platform.azure.defaultMachinePlatform.type to Standard_D11_v2, which does not support PremiumIO, then create manifests: # az vm list-skus --location southcentralus --size Standard_D11_v2 --query "[].capabilities[?name=='PremiumIO'].value" -otsv False install-config.yaml: ------------------- platform: azure: defaultMachinePlatform: type: Standard_D11_v2 baseDomainResourceGroupName: os4-common cloudName: AzurePublicCloud outboundType: Loadbalancer region: southcentralus succeeded to create manifests: $ ./openshift-install create manifests --dir ipi INFO Credentials loaded from file "/home/fedora/.azure/osServicePrincipal.json" INFO Consuming Install Config from target directory INFO Manifests created in: ipi/manifests and ipi/openshift while get expected error when setting type under compute: $ ./openshift-install create manifests --dir ipi INFO Credentials loaded from file "/home/fedora/.azure/osServicePrincipal.json" ERROR failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[0].platform.azure.osDisk.diskType: Invalid value: "Premium_LRS": PremiumIO not supported for instance type Standard_D11_v2 same situation for field vmNetworkingType under defaultMachinePlatform, instance type Standard_B4ms does not support Accelerated networking. # az vm list-skus --location southcentralus --size Standard_B4ms --query "[].capabilities[?name=='AcceleratedNetworkingEnabled'].value" -otsv False install-config.yaml ---------------- platform: azure: defaultMachinePlatform: type: Standard_B4ms vmNetworkingType: "Accelerated" install still succeeds to create manifests file, should exit with error when type and vmNetworkingType setting under compute. ERROR failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: compute[0].platform.azure.vmNetworkingType: Invalid value: "Accelerated": vm networking type is not supported for instance type Standard_B4ms
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-10-08-220853
How reproducible:
always on all supported version
Steps to Reproduce:
1. configure invalid instance type ( e.g unsupported PremiumIO) under defaultMachinePlatform in install-config.yaml 2. create manifests 3.
Actual results:
installer creates manifests successfully.
Expected results:
installer should exit with error, and have similar behavior when invalid instance type is configured under compute and controlPlane.
Additional info:
Please review the following PR: https://github.com/openshift/installer/pull/7818
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
unit test failures rates are high https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-oc-master-unit TestNewAppRunAll/emptyDir_volumes is failing https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_oc/1557/pull-ci-openshift-oc-master-unit/1710206848667226112
Version-Release number of selected component (if applicable):
How reproducible:
Run local or in CI and see that unit test job is failing
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/machine-config-operator/pull/4070
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Openshift Installer supports HTTP Proxy configuration in a restricted environment. However, it seems the bootstrap node doesn't use the given proxy when it grabs ignition assets.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-04-27-113605
How reproducible:
Always
Steps to Reproduce:
1. try IPI installation in a restricted/disconnected network with "publish: Internal", and without using Google Private Access
Actual results:
The installation failed, because bootstrap node failed to fetch its ignition config.
Expected results:
The installation should succeed.
Additional info:
We'd ever fixed similar issue on AWS (and Alibabacloud) by https://bugzilla.redhat.com/show_bug.cgi?id=2090836.
Description of problem:
Azure cluster installation failed with sdn network plugin
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-09-17-045811 4.13.0-0.nightly-2023-09-18-210322
How reproducible:
Sometimes, found 2 failed in 5 jobs in ci
Steps to Reproduce:
1. Install azure cluster with template aos-4_15/ipi-on-azure/versioned-installer-customer_vpc
Actual results:
Installation failed 09-19 10:56:47.536 level=info msg=Cluster operator node-tuning Progressing is True with Reconciling: Working towards "4.15.0-0.nightly-2023-09-17-045811" 09-19 10:56:47.536 level=info msg=Cluster operator openshift-apiserver Progressing is True with APIServerDeployment_PodsUpdating: APIServerDeploymentProgressing: deployment/apiserver.openshift-apiserver: 1/3 pods have been updated to the latest generation 09-19 10:56:47.536 level=info msg=Cluster operator openshift-controller-manager Progressing is True with _DesiredStateNotYetAchieved: Progressing: deployment/controller-manager: updated replicas is 1, desired replicas is 3 09-19 10:56:47.536 level=info msg=Progressing: deployment/route-controller-manager: updated replicas is 1, desired replicas is 3 09-19 10:56:47.536 level=info msg=Cluster operator storage Progressing is True with AzureDiskCSIDriverOperatorCR_AzureDiskDriverNodeServiceController_Deploying::AzureFileCSIDriverOperatorCR_AzureFileDriverNodeServiceController_Deploying: AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods 09-19 10:56:47.536 level=info msg=AzureFileCSIDriverOperatorCRProgressing: AzureFileDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods 09-19 10:56:47.536 level=error msg=Cluster initialization failed because one or more operators are not functioning properly. 09-19 10:56:47.536 level=error msg=The cluster should be accessible for troubleshooting as detailed in the documentation linked below, 09-19 10:56:47.537 level=error msg=https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html 09-19 10:56:47.537 level=error msg=The 'wait-for install-complete' subcommand can then be used to continue the installation 09-19 10:56:47.537 level=error msg=failed to initialize the cluster: Cluster operators authentication, console, control-plane-machine-set, kube-apiserver, machine-config are not available 09-19 10:56:47.537 [[1;31mERROR[0;39m] Installation failed with error code '6'. Aborting execution. oc get nodes NAME STATUS ROLES AGE VERSION jima41501-c646k-master-0 NotReady control-plane,master 3h35m v1.28.2+fde2a12 jima41501-c646k-master-1 Ready control-plane,master 3h35m v1.28.2+fde2a12 jima41501-c646k-master-2 Ready control-plane,master 3h35m v1.28.2+fde2a12 jima41501-c646k-worker-southcentralus1-x82cb Ready worker 3h22m v1.28.2+fde2a12 jima41501-c646k-worker-southcentralus2-jxbbt Ready worker 3h19m v1.28.2+fde2a12 jima41501-c646k-worker-southcentralus3-s4j6c Ready worker 3h18m v1.28.2+fde2a12 huirwang@huirwang-mac workspace % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.15.0-0.nightly-2023-09-17-045811 False True True 3h31m WellKnownAvailable: The well-known endpoint is not yet available: kube-apiserver oauth endpoint https://10.0.0.7:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance) baremetal 4.15.0-0.nightly-2023-09-17-045811 True False False 3h30m cloud-controller-manager 4.15.0-0.nightly-2023-09-17-045811 True False False 3h34m cloud-credential 4.15.0-0.nightly-2023-09-17-045811 True False False 3h39m cluster-autoscaler 4.15.0-0.nightly-2023-09-17-045811 True False False 3h30m config-operator 4.15.0-0.nightly-2023-09-17-045811 True False False 3h31m console 4.15.0-0.nightly-2023-09-17-045811 False True False 3h20m DeploymentAvailable: 0 replicas available for console deployment... control-plane-machine-set 4.15.0-0.nightly-2023-09-17-045811 False True False 3h24m Missing 1 available replica(s) csi-snapshot-controller 4.15.0-0.nightly-2023-09-17-045811 True False False 3h30m dns 4.15.0-0.nightly-2023-09-17-045811 True True False 3h30m DNS "default" reports Progressing=True: "Have 5 available node-resolver pods, want 6." etcd 4.15.0-0.nightly-2023-09-17-045811 True True True 3h29m NodeControllerDegraded: The master nodes not ready: node "jima41501-c646k-master-0" not ready since 2023-09-19 02:13:06 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) image-registry 4.15.0-0.nightly-2023-09-17-045811 True True False 3h19m Progressing: The registry is ready... ingress 4.15.0-0.nightly-2023-09-17-045811 True False False 3h19m insights 4.15.0-0.nightly-2023-09-17-045811 True False False 3h19m kube-apiserver 4.15.0-0.nightly-2023-09-17-045811 False True True 3h31m StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 8 kube-controller-manager 4.15.0-0.nightly-2023-09-17-045811 True True True 3h27m NodeControllerDegraded: The master nodes not ready: node "jima41501-c646k-master-0" not ready since 2023-09-19 02:13:06 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) kube-scheduler 4.15.0-0.nightly-2023-09-17-045811 True True True 3h27m NodeControllerDegraded: The master nodes not ready: node "jima41501-c646k-master-0" not ready since 2023-09-19 02:13:06 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) kube-storage-version-migrator 4.15.0-0.nightly-2023-09-17-045811 True False False 3h30m machine-api 4.15.0-0.nightly-2023-09-17-045811 True False False 3h17m machine-approver 4.15.0-0.nightly-2023-09-17-045811 True False False 3h30m machine-config 4.15.0-0.nightly-2023-09-17-045811 False False True 164m Cluster not available for [{operator 4.15.0-0.nightly-2023-09-17-045811}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [context deadline exceeded, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)] marketplace 4.15.0-0.nightly-2023-09-17-045811 True False False 3h30m monitoring 4.15.0-0.nightly-2023-09-17-045811 True False False 3h15m network 4.15.0-0.nightly-2023-09-17-045811 True True False 3h31m DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)... node-tuning 4.15.0-0.nightly-2023-09-17-045811 True True False 3h30m Working towards "4.15.0-0.nightly-2023-09-17-045811" openshift-apiserver 4.15.0-0.nightly-2023-09-17-045811 True True True 3h24m APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver () openshift-controller-manager 4.15.0-0.nightly-2023-09-17-045811 True True False 3h27m Progressing: deployment/controller-manager: updated replicas is 1, desired replicas is 3... openshift-samples 4.15.0-0.nightly-2023-09-17-045811 True False False 3h23m operator-lifecycle-manager 4.15.0-0.nightly-2023-09-17-045811 True False False 3h30m operator-lifecycle-manager-catalog 4.15.0-0.nightly-2023-09-17-045811 True False False 3h30m operator-lifecycle-manager-packageserver 4.15.0-0.nightly-2023-09-17-045811 True False False 3h25m service-ca 4.15.0-0.nightly-2023-09-17-045811 True False False 3h31m storage 4.15.0-0.nightly-2023-09-17-045811 True True False 3h30m AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods... [systemd] Failed Units: 1 openshift-azure-routes.service [core@jima41501-c646k-master-0 ~]$ sudo -i [systemd] Failed Units: 1 openshift-azure-routes.service [root@jima41501-c646k-master-0 ~]# systemctl status openshift-azure-routes.service × openshift-azure-routes.service - Work around Azure load balancer hairpin Loaded: loaded (/etc/systemd/system/openshift-azure-routes.service; static) Active: failed (Result: exit-code) since Tue 2023-09-19 02:10:31 UTC; 3h 23min ago Duration: 55ms TriggeredBy: ● openshift-azure-routes.path Process: 13908 ExecStart=/bin/bash /opt/libexec/openshift-azure-routes.sh start (code=exited, status=1/FAILURE) Main PID: 13908 (code=exited, status=1/FAILURE) CPU: 77ms Sep 19 02:10:31 jima41501-c646k-master-0 systemd[1]: Started Work around Azure load balancer hairpin. Sep 19 02:10:31 jima41501-c646k-master-0 openshift-azure-routes[13908]: processing v4 vip 10.0.0.4 Sep 19 02:10:31 jima41501-c646k-master-0 openshift-azure-routes[13908]: /opt/libexec/openshift-azure-routes.sh: line 130: ovnkContaine> Sep 19 02:10:31 jima41501-c646k-master-0 systemd[1]: openshift-azure-routes.service: Main process exited, code=exited, status=1/FAILURE Sep 19 02:10:31 jima41501-c646k-master-0 systemd[1]: openshift-azure-routes.service: Failed with result 'exit-code'. 4.13 failed in ci https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.13-e2e-azure-sdn/1703878138968150016/artifacts/e2e-azure-sdn/gather-extra/artifacts/oc_cmds/clusteroperators NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.13.0-0.nightly-2023-09-18-210322 False True True 55m WellKnownAvailable: The well-known endpoint is not yet available: kube-apiserver oauth endpoint https://10.0.0.6:6443/.well-known/oauth-authorization-server is not yet served and authentication operator keeps waiting (check kube-apiserver operator, and check that instances roll out successfully, which can take several minutes per instance) baremetal 4.13.0-0.nightly-2023-09-18-210322 True False False 54m cloud-controller-manager 4.13.0-0.nightly-2023-09-18-210322 True False False 56m cloud-credential 4.13.0-0.nightly-2023-09-18-210322 True False False 58m cluster-autoscaler 4.13.0-0.nightly-2023-09-18-210322 True False False 53m config-operator 4.13.0-0.nightly-2023-09-18-210322 True False False 55m console 4.13.0-0.nightly-2023-09-18-210322 False True False 45m DeploymentAvailable: 0 replicas available for console deployment... control-plane-machine-set 4.13.0-0.nightly-2023-09-18-210322 False True False 47m Missing 1 available replica(s) csi-snapshot-controller 4.13.0-0.nightly-2023-09-18-210322 True False False 54m dns 4.13.0-0.nightly-2023-09-18-210322 True True False 53m DNS "default" reports Progressing=True: "Have 5 available node-resolver pods, want 6." etcd 4.13.0-0.nightly-2023-09-18-210322 True True True 52m NodeControllerDegraded: The master nodes not ready: node "ci-op-pjxb081y-0c3e0-bxvlr-master-0" not ready since 2023-09-18 21:40:51 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) image-registry 4.13.0-0.nightly-2023-09-18-210322 True True False 45m NodeCADaemonProgressing: The daemon set node-ca is deploying node pods... ingress 4.13.0-0.nightly-2023-09-18-210322 True False False 44m insights 4.13.0-0.nightly-2023-09-18-210322 True False False 47m kube-apiserver 4.13.0-0.nightly-2023-09-18-210322 False True True 53m StaticPodsAvailable: 0 nodes are active; 3 nodes are at revision 0; 0 nodes have achieved new revision 10 kube-controller-manager 4.13.0-0.nightly-2023-09-18-210322 True True True 51m NodeControllerDegraded: The master nodes not ready: node "ci-op-pjxb081y-0c3e0-bxvlr-master-0" not ready since 2023-09-18 21:40:51 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) kube-scheduler 4.13.0-0.nightly-2023-09-18-210322 True True True 51m NodeControllerDegraded: The master nodes not ready: node "ci-op-pjxb081y-0c3e0-bxvlr-master-0" not ready since 2023-09-18 21:40:51 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) kube-storage-version-migrator 4.13.0-0.nightly-2023-09-18-210322 True False False 54m machine-api 4.13.0-0.nightly-2023-09-18-210322 True False False 46m machine-approver 4.13.0-0.nightly-2023-09-18-210322 True False False 54m machine-config 4.13.0-0.nightly-2023-09-18-210322 False False True 31m Cluster not available for [{operator 4.13.0-0.nightly-2023-09-18-210322}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 6, updated: 6, ready: 5, unavailable: 1)] marketplace 4.13.0-0.nightly-2023-09-18-210322 True False False 53m monitoring 4.13.0-0.nightly-2023-09-18-210322 True False False 43m network 4.13.0-0.nightly-2023-09-18-210322 True True False 55m DaemonSet "/openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)... node-tuning 4.13.0-0.nightly-2023-09-18-210322 True True False 53m Working towards "4.13.0-0.nightly-2023-09-18-210322" openshift-apiserver 4.13.0-0.nightly-2023-09-18-210322 True True True 44m APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable for apiserver.openshift-apiserver (3 containers are waiting in pending apiserver-66d764fbd6-r2s8d pod) openshift-controller-manager 4.13.0-0.nightly-2023-09-18-210322 True True False 54m Progressing: deployment/controller-manager: updated replicas is 1, desired replicas is 3... openshift-samples 4.13.0-0.nightly-2023-09-18-210322 True False False 47m operator-lifecycle-manager 4.13.0-0.nightly-2023-09-18-210322 True False False 54m operator-lifecycle-manager-catalog 4.13.0-0.nightly-2023-09-18-210322 True False False 54m operator-lifecycle-manager-packageserver 4.13.0-0.nightly-2023-09-18-210322 True False False 48m service-ca 4.13.0-0.nightly-2023-09-18-210322 True False False 55m storage 4.13.0-0.nightly-2023-09-18-210322 True True False 54m AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods...
Expected results:
Installation succeeds
Additional info:
We doubted this is caused by PR https://github.com/openshift/machine-config-operator/pull/3878/files
Description of problem:
User may provide an DNS domain outside GCP, once custom DNS is enabled, installer should skip DNS zone validation: level=fatal msg="failed to fetch Terraform Variables: failed to generate asset \"Terraform Variables\": failed to get GCP public zone: no matching public DNS Zone found"
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-02-03-192446 4.16.0-0.nightly-2024-02-03-221256
How reproducible:
Always
Steps to Reproduce:
1. Enable custom DNS on gcp: platform.gcp.userProvisionedDNS:Enabled and featureSet:TechPreviewNoUpgrade 2. config a baseDomain which does not exist on GCP.
Actual results:
See description.
Expected results:
Installer should skip the validation, as the custom domain may not exist on GCP
Additional info:
Description of problem:
Due to the way that the termination handlers unit tests are configured, it is possible in some cases for the counter of http requests to the mock handler can cause the test to deadlock and time out. This happens randomly as the ordering of the tests has an effect on when the bug occurs.
Version-Release number of selected component (if applicable):
4.13+
How reproducible:
It happens randomly when run in CI, or when the full suite is run. But if the tests are focused it will happen every time. Focusing on "poll URL cannot be reached" will exploit the unit test.
Steps to Reproduce:
1. add `-focus "poll URL cannot be reached"` to unit test ginkgo arguments 2. run `make unit`
Actual results:
test suite hangs after this output: "Handler Suite when running the handler when polling the termination endpoint and the poll URL cannot be reached should return an error /home/mike/dev/machine-api-provider-aws/pkg/termination/handler_test.go:197"
Expected results:
Tests pass
Additional info:
to fix this we need to isolate the test in its own context block, this patch should do the trick: diff --git a/pkg/termination/handler_test.go b/pkg/termination/handler_test.go index 2b98b08b..0f85feae 100644 --- a/pkg/termination/handler_test.go +++ b/pkg/termination/handler_test.go @@ -187,7 +187,9 @@ var _ = Describe("Handler Suite", func() { Consistently(nodeMarkedForDeletion(testNode.Name)).Should(BeFalse()) }) }) + }) + Context("when the termination endpoint is not valid", func() { Context("and the poll URL cannot be reached", func() { BeforeEach(func() { nonReachable := "abc#1://localhost"
Please review the following PR: https://github.com/openshift/openstack-cinder-csi-driver-operator/pull/133
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Without this fix, our e2e tests is having to be retested many times over and still doesn't guarantee success, when csi driver hits this issue, it doesn't seem to get out of it easily. There is no impact on functionality. But there are too many of these errors. snapshot controller failed to update ...xxxxx the object has been modified; please apply your changes to the latest version and try again
Version-Release number of selected component (if applicable):
How reproducible:
almost every PR in oadp-operator have one or more of these "flake" errors that prevents our e2e from succeeding forcing a retest.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please accept cherrypicks in https://github.com/openshift/csi-external-snapshotter/pull/140 https://github.com/openshift/csi-external-snapshotter/pull/141 https://github.com/openshift/csi-external-snapshotter/pull/142 https://github.com/openshift/csi-external-snapshotter/pull/143 to help us deal with many flakes in our #forum-oadp e2e from CSI drivers failing to remove snapshots annotations
clones https://github.com/kubernetes-csi/external-snapshotter/issues/748
Please backport this to all OCP versions that OpenShift API for Data Protection is supported and tested on, currently 4.12+
slack: https://redhat-internal.slack.com/archives/CBQHQFU0N/p1707342685875549
UDP Packets are subject to SNAT in a self-managed OCP 4.13.13 cluster on Azure (OVN-K as CNI) using a Load Balancer Service with `externalTrafficPolicy: Local`. UDP Packets correctly arrive to the Node hosting the Pod but the source IP seen by the Pod is the OVN GW Router of the Node.
I've reproduced the customer scenario with the following steps:
This is issue is very critical because it is blocking customer business.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
DomainMapping CRD is still using API version v1alpha1 but v1alpha1 will be removed from the Serverless Operator version 1.33. So, upgrade the API version to v1beta1 and it is available since Serverless operator 1.21.
Additional info:
NOTE: This should be backported to 4.11 and also check min Serverless operator version supported in 4.11 slack thread: https://redhat-internal.slack.com/archives/CJYKV1YAH/p1693809331579619
Please review the following PR: https://github.com/openshift/cloud-network-config-controller/pull/122
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Configure vm type as Standard_NP10s in install-config, which only supports Generation V1. -------------- compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: azure: type: Standard_NP10s replicas: 3 controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: azure: type: Standard_NP10s replicas: 3 Continue installation, installer failed when provisioning bootstrap node. -------------- ERROR ERROR Error: creating Linux Virtual Machine: (Name "jima1211test-rqfhm-bootstrap" / Resource Group "jima1211test-rqfhm-rg"): compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="BadRequest" Message="The selected VM size 'Standard_NP10s' cannot boot Hypervisor Generation '2'. If this was a Create operation please check that the Hypervisor Generation of the Image matches the Hypervisor Generation of the selected VM Size. If this was an Update operation please select a Hypervisor Generation '2' VM Size. For more information, see https://aka.ms/azuregen2vm" ERROR ERROR with azurerm_linux_virtual_machine.bootstrap, ERROR on main.tf line 193, in resource "azurerm_linux_virtual_machine" "bootstrap": ERROR 193: resource "azurerm_linux_virtual_machine" "bootstrap" { ERROR ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failure applying terraform for "bootstrap" stage: error applying Terraform configs: failed to apply Terraform: exit status 1 ERROR ERROR Error: creating Linux Virtual Machine: (Name "jima1211test-rqfhm-bootstrap" / Resource Group "jima1211test-rqfhm-rg"): compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="BadRequest" Message="The selected VM size 'Standard_NP10s' cannot boot Hypervisor Generation '2'. If this was a Create operation please check that the Hypervisor Generation of the Image matches the Hypervisor Generation of the selected VM Size. If this was an Update operation please select a Hypervisor Generation '2' VM Size. For more information, see https://aka.ms/azuregen2vm" ERROR ERROR with azurerm_linux_virtual_machine.bootstrap, ERROR on main.tf line 193, in resource "azurerm_linux_virtual_machine" "bootstrap": ERROR 193: resource "azurerm_linux_virtual_machine" "bootstrap" { ERROR ERROR seems that issue is introduced by https://github.com/openshift/installer/pull/7642/
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-09-012410
How reproducible:
Always
Steps to Reproduce:
1. configure vm type to Standard_NP10s on control-plane in install-config.yaml 2. install cluster 3.
Actual results:
installer failed when provisioning bootstrap node
Expected results:
installation get successful
Additional info:
The sno_arm.txt integration test fails because it tries to extract arm64 pxe bits from the OKD release payload that is x86_64.
AC:
skip or remove the sno_arm.txt test.
The IP range 168.254.0.0/16 that we chose as default for the transit switch is a public one. Let's use a private one instead, making sure it won't collide with address blocks already in use.
In the future we might want to make this configurable, but for now let's just make sure we pick an IP range that is not used elsewhere in openshift.
Description of problem:
RPS configuration test failed with the following error:
[FAILED] Failure recorded during attempt 1: a host device rps mask is different from the reserved CPUs; have "0" want "" Expected <bool>: false to be true In [It] at: /tmp/cnf-ZdGbI/cnf-features-deploy/vendor/github.com/onsi/gomega/internal/assertion.go:62 @ 09/06/23 03:47:44.144 < Exit [It] [test_id:55012] Should have the correct RPS configuration - /tmp/cnf-ZdGbI/cnf-features-deploy/vendor/github.com/openshift/cluster-node-tuning-operator/test/e2e/performanceprofile/functests/1_performance/performance.go:337 @ 09/06/23 03:47:44.144 (39.949s)
Full report:
How reproducible:
Very often
Steps to Reproduce:
1. Reproduce automatically by the cnf-tests nightly job
Actual results:
Some of the virtual devices are not configured with the correct RPS mask
Expected results:
All virtual network devices are expected to have the correct RPS mask
Please review the following PR: https://github.com/openshift/machine-api-provider-aws/pull/82
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/machine-api-operator/pull/1190
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When using the performancePlus in storageclass in azure-disk-csi-driver, it asks for volume size large than 512GB, but the message shows "The performancePlus flag can only be set on disks at least 512 GB in size" which means 512 is supported. It will make confuse to users.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Create sc with parameters: enablePerformancePlus: "true" 2. Create pvc with 512Gi 3. Get ProvisioningFailed message as below which is a bit confused: Warning ProvisioningFailed <invalid> (x5 over <invalid>) disk.csi.azure.com_wduan0810manual-b5dng-master-1_d7a29bbf-3f49-4207-af33-056e0814f6e2 failed to provision volume with StorageClass "managed-csi-test-28-sssdlrs-enableperformanceplus": rpc error: code = Internal desc = Retriable: false, RetryAfter: 0s, HTTPStatusCode: 400, RawError: { "error": { "code": "BadRequest", "message": "The performancePlus flag can only be set on disks at least 512 GB in size." } }
Actual results:
Expected results:
Message should mention larger than 512GB, but not "at least".
Additional info:
Please review the following PR: https://github.com/openshift/installer/pull/7494
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
ConfigObserver controller waits until the all given informers are marked as synced including the build informer. However, when build capability is disabled, that causes ConfigObserver's blockage and never runs. This is likely only happening on 4.15 because capability watching mechanism was bound to ConfigObserver in 4.15.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Launch cluster-bot cluster via "launch 4.15.0-0.nightly-2023-11-05-192858,openshift/cluster-openshift-controller-manager-operator#315 no-capabilities"
Steps to Reproduce:
1. 2. 3.
Actual results:
ConfigObserver controller stuck in failure
Expected results:
ConfigObserver controller runs and successfully clear all deployer service accounts when deploymentconfig capability is disabled.
Additional info:
As an openshift developer, I want to remove the image openshift-proxy-pull-test-container from the build, so we will not be affected by the possible bugs during the image build.
we requested the ART team to add this image in the ticket https://issues.redhat.com/browse/ART-2961
Please review the following PR: https://github.com/openshift/cloud-provider-kubevirt/pull/25
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-machine-approver/pull/201
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/operator-framework-catalogd/pull/27
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When installing a new vSphere cluster with static IPs, control plane machine sets (CPMS) are also enabled in TechPreviewNoUpgrade and the installer applies the incorrect config to the CPMS resulting in masters being recreated.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
always
Steps to Reproduce:
1. create install-config.yaml with static IPs following documentation 2. run `openshift-install create cluster` 3. as install progresses, watch the machines definitions
Actual results:
new master machines are created
Expected results:
all machines are the same as what was created by the installer.
Additional info:
Description of problem:
Some 3rd party clouds do not require the use of an external CCM. The installer enables an external CCM by default whenever the platform is external.
Version-Release number of selected component (if applicable):
4.14 nightly
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
The external CCM can not be disabled when the platform type is external.
Expected results:
The external CCM should be able to be disabled when the platform type is external.
Additional info:
Description of problem:
The flaky-e2e-test suite has been failing consistently due to some changes made to how the test environments are set up in each test. Two tests in particular have been failing and need to be fixed: [FLAKE] should clear up the condition in the InstallPlan status that contains an error message when a valid OperatorGroup is created" [FLAKE] consistent generation
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Run flaky-e2e-test suite
Actual results:
Tests never pass
Expected results:
Tests pass at least a majority of the time
Additional info:
Description of problem:
level=error198level=error msg=Error: waiting for EC2 Instance (i-054a010f3e99f7a2c) create: timeout while waiting for state to become 'running' (last state: 'pending', timeout: 10m0s)199level=error200level=error msg= with module.masters.aws_instance.master[2],201level=error msg= on master/main.tf line 136, in resource "aws_instance" "master":202level=error msg= 136: resource "aws_instance" "master" {203level=error204panic: runtime error: invalid memory address or nil pointer dereference205[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1936dcc]206207goroutine 1 [running]:208github.com/openshift/installer/pkg/asset.PersistToFile({0x22860140?, 0x277372f0?}, {0x7ffc102e22db, 0xe})209 /go/src/github.com/openshift/installer/pkg/asset/asset.go:57 +0xac210github.com/openshift/installer/pkg/asset.(*fileWriterAdapter).PersistToFile(0x227fa3e0?, {0x7ffc102e22db?, 0x277372f0?})211 /go/src/github.com/openshift/installer/pkg/asset/filewriter.go:19 +0x31212main.runTargetCmd.func1({0x7ffc102e22db, 0xe})213 /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:277 +0x24a214main.runTargetCmd.func2(0x275d0340?, {0xc0007a6d00?, 0x1?, 0x1?})215 /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:302 +0xe7216github.com/spf13/cobra.(*Command).execute(0x275d0340, {0xc0007a6cc0, 0x1, 0x1})217 /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:920 +0x847218github.com/spf13/cobra.(*Command).ExecuteC(0xc000956000)219 /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:1040 +0x3bd220github.com/spf13/cobra.(*Command).Execute(...)221 /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:968222main.installerMain()223 /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:56 +0x2b0224main.main()225 /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:33 +0xff226Installer exit with code 2
Version-Release number of selected component (if applicable):
4.15
How reproducible:
I noticed it on a presubmit
Steps to Reproduce:
1.Run pull-ci-openshift-origin-master-e2e-aws-ovn-fips job on openshift/origin repo presubmit 2. 3.
Actual results:
Expected results:
Additional info:
Example where it occurred: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/28372/pull-ci-openshift-origin-master-e2e-aws-ovn-fips/1719449092209250304 This shows it happed on several jobs: https://search.ci.openshift.org/?search=asset.PersistToFile&maxAge=48h&context=1&type=build-log&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
This is a clone of issue OCPBUGS-24537. The following is the description of the original issue:
—
Description of problem:
4.15 nightly payloads have been affected by this test multiple times: : [sig-arch] events should not repeat pathologically for ns/openshift-kube-scheduler expand_less0s{ 1 events happened too frequently event happened 21 times, something is wrong: namespace/openshift-kube-scheduler node/ci-op-2gywzc86-aa265-5skmk-master-1 pod/openshift-kube-scheduler-guard-ci-op-2gywzc86-aa265-5skmk-master-1 hmsg/2652c73da5 - reason/ProbeError Readiness probe error: Get "https://10.0.0.7:10259/healthz": dial tcp 10.0.0.7:10259: connect: connection refused result=reject body: From: 08:41:08Z To: 08:41:09Z} In each of the 10 jobs aggregated, 2 to 3 jobs failed with this test. Historically this test passed 100%. But with the past two days test data, the passing rate has dropped to 97% and aggregator started allowing this in the latest payload: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-azure-ovn-upgrade-4.15-micro-release-openshift-release-analysis-aggregator/1732295947339173888 The first payload this started appearing is https://amd64.ocp.releases.ci.openshift.org/releasestream/4.15.0-0.nightly/release/4.15.0-0.nightly-2023-12-05-071627. All the events happened during cluster-operator/kube-scheduler progressing. For comparison, here is a passed job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade/1731936539870498816 Here is a failed one: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade/1731936538192777216 They both have the same set of probe error events. For the passing jobs, the frequency is lower than 20, while for the failed job, one of those events repeated more than 20 times and therefore results in the test failure.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-29209. The following is the description of the original issue:
—
Description of problem:
HyperShift operator is applying control-plane-pki-operator RBAC resources regardless of if PKI reconciliation is disabled for the HostedCluster.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
100%
Steps to Reproduce:
1. Create 4.15 HostedCluster with PKI reconciliation disabled 2. Unused RBAC resources for control-plane-pki-operator is created
Actual results:
Unused RBAC resources for control-plane-pki-operator is created
Expected results:
RBAC resources for control-plane-pki-operator should not be created if deployment for control-plane-pki-operator itself is not created.
Additional info:
Description of problem:
While attempting to provision 300 clusters every hour of mixed cluster sizes (SNO, Compact, and standard cluster sizes) It appears that the metal3 baremetal operator has his a failure to provision any clusters. Out of the 1850 attempted clusters, only 282 successfully provisioned (Mostly SNO size). There seems to be many errors in the baremetal operator log, some of which are actual stack traces but it is unclear if this is the actually reason why the clusters began to fail to install with 100% not installing on the 3rd wave and beyond.
Version-Release number of selected component (if applicable):
Hub OCP - 4.14.0-rc.2 Deployed Cluster OCP - 4.14.0-rc.2 ACM - 2.9.0-DOWNSTREAM-2023-09-27-22-12-46
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Some of the errors found in the logs: {"level":"error","ts":"2023-09-28T22:39:56Z","msg":"Reconciler error","controller":"baremetalhost","controllerGroup":"metal3.io","controllerKind":"BareMetalHost","BareMetalHost":{"name":"vm01343","namespace":"compact-00046"},"namespace":"compact-00046","name":"vm01343","reconcileID":"4bbfa52f-12a6-4983-b86b-01086491de9f","error":"action \"provisioning\" failed: failed to provision: failed to change provisioning state to \"active\": Internal Server Error","errorVerbose":"Internal Server Error\nfailed to change provisioning state to \"active\"\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).tryChangeNodeProvisionState\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:740\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).changeNodeProvisionState\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:750\ngithub.com/metal3-io/baremetal-operator/pkg/provisioner/ironic.(*ironicProvisioner).Provision\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/provisioner/ironic/ironic.go:1604\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).actionProvisioning\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:1179\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).handleProvisioning\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:527\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:202\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:225\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1598\nfailed to provision\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).actionProvisioning\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:1188\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).handleProvisioning\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:527\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:202\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:225\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1598\naction \"provisioning\" failed\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:229\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1598","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226"} {"level":"info","ts":"2023-09-29T16:11:24Z","logger":"provisioner.ironic","msg":"error caught while checking endpoint","host":"standard-00241~vm03618","endpoint":"https://metal3-state.openshift-machine-api.svc.cluster.local:6388/v1/","error":"Bad Gateway"}
Description of problem:
Set custom security group IDs in the installconfig.platform.aws.defaultMachinePlatform.additionalSecurityGroupIDs field of install-config.yaml such as: apiVersion: v1 controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: {} replicas: 3 compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: {} replicas: 3 metadata: name: gpei-test1013 platform: aws: region: us-east-2 subnets: - subnet-0bc86b64e7736479c - subnet-0addd33c410b52251 - subnet-093392f94a4099566 - subnet-0b915a53042b6dc61 defaultMachinePlatform: additionalSecurityGroupIDs: - sg-0fbc4c9733e6c18e7 - sg-0b46b502b575d30ba - sg-02a59f8662d10c6d3 After installation, check the Security Groups attached to master and worker, master doesn't have the specified custom security groups attached while workers have. For one of the masters: [root@preserve-gpei-worker k_files]# aws ec2 describe-instances --instance-ids i-08c0b0b6e4308be3b --query 'Reservations[*].Instances[*].SecurityGroups[*]' --output json [ [ [ { "GroupName": "terraform-20231013000602175000000002", "GroupId": "sg-04b104d07075afe96" } ] ] ] For one of the workers: [root@preserve-gpei-worker k_files]# aws ec2 describe-instances --instance-ids i-00643f07748ec75da --query 'Reservations[*].Instances[*].SecurityGroups[*]' --output json [ [ [ { "GroupName": "test-sg2", "GroupId": "sg-0b46b502b575d30ba" }, { "GroupName": "terraform-20231013000602174300000001", "GroupId": "sg-0d7cd50d4cb42e513" }, { "GroupName": "test-sg3", "GroupId": "sg-02a59f8662d10c6d3" }, { "GroupName": "test-sg1", "GroupId": "sg-0fbc4c9733e6c18e7" } ] ] ] Also checked the master's controlplanemachineset, it does have the custom security groups configured, but they're not attached to the master instance in the end. [root@preserve-gpei-worker k_files]# oc get controlplanemachineset -n openshift-machine-api cluster -o yaml |yq .spec.template.machines_v1beta1_machine_openshift_io.spec.providerSpec.value.securityGroups - filters: - name: tag:Name values: - gpei-test1013-8lwtb-master-sg - id: sg-02a59f8662d10c6d3 - id: sg-0b46b502b575d30ba - id: sg-0fbc4c9733e6c18e7
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-10-12-104602
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
It works well when setting the security groups in installconfig.controlPlane.platform.aws.additionalSecurityGroupIDs
Description of problem:
If you allow the installer to provision a Power VS Workspace instead of bringing your own, it can sometimes fail when creating a network. This is because Power Edge Router can sometimes take up to a minute to configure.
Version-Release number of selected component (if applicable):
How reproducible:
Infrequent, but will probably hit it within 50-100 runs
Steps to Reproduce:
1. Install on Power VS with IPI with serviceInstanceGUID not set in the install-config.yaml 2. Occasionally you'll observe a failure due to the workspace not being ready for networks
Actual results:
Failure
Expected results:
Success
Additional info:
Not consistently reproducible
Please review the following PR: https://github.com/openshift/alibaba-disk-csi-driver-operator/pull/61
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The CannotRetrieveUpdates alert currently provides a link to the web-console so the responding admin can find the RetrievedUpdates=False message. But some admins lack convenient console access (e.g. they're SSHing in to a restricted network or the cluster does not have the Console capability enabled. Those admins would benefit from oc ... command-line advice.
The alert is new in 4.6:
$ for Y in $(seq 5 12); do git --no-pager grep CannotRetrieveUpdates "origin/release-4.${Y}"; done | head -n1 origin/release-4.6:docs/user/status.md:When CVO is unable to retrieve recommended updates the CannotRetrieveUpdates alert will fire containing the reason. This alert will not fire when the reason updates cannot be retrieved is NoChannel.
and has never provided command-line advice.
Consistently.
1. Install a cluster.
2. Set an impossible channel, such as oc adm upgrade channel testing.
3. Wait an hour.
4. Check firing alerts in /monitoring/alerts.
5. Click through to CannotRetrieveUpdates.
Failure to retrieve updates means that cluster administrators...
description does not provide oc ... advice.
Failure to retrieve updates means that cluster administrators...
description does provide oc ... advice.
This is a manual "clone" of issue OCPBUGS-27397. The following is the description of the original issue:
Description of problem:
After the update to OpenShift Container Platform 4.13, it was reported that the SRV query for _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net is failing. The query is sent to CoreDNS is not matching any configured forwardPlugin and therefore the default is applied. When revering the dns-default pod Image back to OpenShift Container Platform 4.12 it works and this is also the workaround that has been put in place as production application were affected. Testing shows that the problem is available in OpenShift Container Platform 4.13, 4.14 and even 4.15. Forcing TCP on pod level does not change the behavior and the query will still fail. But when configuring a specific forwardPlugin for the Domain and enforcing DNS over TCP it also works again. - Adjusting bufsize did/does not help as the result was still the same (suspecting this because of https://issues.redhat.com/browse/OCPBUGS-21901 - but again, as no effect) - Only way to make it work, is to force_tcp either in default ". /etc/resolv.conf" section or by configure a forwardPlugin and forcing TCP Checking upstream, I found https://github.com/coredns/coredns/issues/5953 respectively https://github.com/coredns/coredns/pull/6277 which I suspect being related. When building from master CoreDNS branch it indeed starts to work again and resolving the SRV entry is possible again. --- $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.27 True False 24h Cluster version is 4.13.27 $ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES dns-default-626td 2/2 Running 0 3m15s 10.128.2.49 aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh <none> <none> dns-default-74nnw 2/2 Running 0 87s 10.131.0.47 aro-cluster-h78zv-h94mh-worker-eastus1-99l7n <none> <none> dns-default-8mggz 2/2 Running 0 2m31s 10.128.1.121 aro-cluster-h78zv-h94mh-master-0 <none> <none> dns-default-clgkg 2/2 Running 0 109s 10.129.2.187 aro-cluster-h78zv-h94mh-worker-eastus3-jhvff <none> <none> dns-default-htdw2 2/2 Running 0 2m10s 10.129.0.43 aro-cluster-h78zv-h94mh-master-2 <none> <none> dns-default-wprln 2/2 Running 0 2m52s 10.130.1.70 aro-cluster-h78zv-h94mh-master-1 <none> <none> node-resolver-4dmgj 1/1 Running 0 17h 10.0.2.4 aro-cluster-h78zv-h94mh-worker-eastus3-jhvff <none> <none> node-resolver-5c6tj 1/1 Running 0 17h 10.0.0.10 aro-cluster-h78zv-h94mh-master-0 <none> <none> node-resolver-chfr6 1/1 Running 0 17h 10.0.0.7 aro-cluster-h78zv-h94mh-master-2 <none> <none> node-resolver-mnhsp 1/1 Running 0 17h 10.0.2.6 aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh <none> <none> node-resolver-snxsb 1/1 Running 0 17h 10.0.0.9 aro-cluster-h78zv-h94mh-master-1 <none> <none> node-resolver-sp7h8 1/1 Running 0 17h 10.0.2.5 aro-cluster-h78zv-h94mh-worker-eastus1-99l7n <none> <none> $ oc get pod -o wide -n project-100 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES tools-54f4d6844b-lr6z9 1/1 Running 0 17h 10.131.0.40 aro-cluster-h78zv-h94mh-worker-eastus1-99l7n <none> <none> $ oc get dns.operator default -o yaml apiVersion: operator.openshift.io/v1 kind: DNS metadata: creationTimestamp: "2024-01-11T09:14:03Z" finalizers: - dns.operator.openshift.io/dns-controller generation: 4 name: default resourceVersion: "4216641" uid: c8f5c627-2010-4c4a-a5fe-ed87f320e427 spec: logLevel: Normal nodePlacement: {} operatorLogLevel: Normal servers: - forwardPlugin: policy: Random protocolStrategy: "" upstreams: - 10.0.0.9 name: example zones: - example.xyz upstreamResolvers: policy: Sequential transportConfig: {} upstreams: - port: 53 type: SystemResolvConf status: clusterDomain: cluster.local clusterIP: 172.30.0.10 conditions: - lastTransitionTime: "2024-01-19T07:54:18Z" message: Enough DNS pods are available, and the DNS service has a cluster IP address. reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2024-01-19T07:55:02Z" message: All DNS and node-resolver pods are available, and the DNS service has a cluster IP address. reason: AsExpected status: "False" type: Progressing - lastTransitionTime: "2024-01-18T13:29:59Z" message: The DNS daemonset has available pods, and the DNS service has a cluster IP address. reason: AsExpected status: "True" type: Available - lastTransitionTime: "2024-01-11T09:14:04Z" message: DNS Operator can be upgraded reason: AsExpected status: "True" type: Upgradeable $ oc rsh -n project-100 tools-54f4d6844b-lr6z9 sh-4.4$ host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net Host _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net not found: 2(SERVFAIL) $ oc logs dns-default-74nnw Defaulted container "dns" out of: dns, kube-rbac-proxy .:5353 hostname.bind.:5353 example.xyz.:5353 [INFO] plugin/reload: Running configuration SHA512 = 88c7c194d29d0a23b322aeee1eaa654ef385e6bd1affae3715028aba1d33cc8340e33184ba183f87e6c66a2014261c3e02edaea8e42ad01ec6a7c5edb34dfc6a CoreDNS-1.10.1 linux/amd64, go1.19.13 X:strictfipsruntime, [INFO] 10.131.0.40:39333 - 54228 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.001868103s [ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size [INFO] 10.131.0.40:39333 - 54228 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.003223099s [ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size --- https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp/4.12.47/release.txt - using quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3c0de49c0e76f2ee23a107fc9397f2fd32e7a6a8a458906afd6df04ff5bb0f7b $ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES dns-default-8vrwd 2/2 Running 0 6m22s 10.129.0.45 aro-cluster-h78zv-h94mh-master-2 <none> <none> dns-default-fm59d 2/2 Running 0 7m4s 10.129.2.190 aro-cluster-h78zv-h94mh-worker-eastus3-jhvff <none> <none> dns-default-grtqs 2/2 Running 0 7m48s 10.130.1.73 aro-cluster-h78zv-h94mh-master-1 <none> <none> dns-default-l8mp2 2/2 Running 0 6m43s 10.131.0.49 aro-cluster-h78zv-h94mh-worker-eastus1-99l7n <none> <none> dns-default-slc4n 2/2 Running 0 8m11s 10.128.1.126 aro-cluster-h78zv-h94mh-master-0 <none> <none> dns-default-xgr7c 2/2 Running 0 7m25s 10.128.2.51 aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh <none> <none> node-resolver-2nmpx 1/1 Running 0 10m 10.0.2.4 aro-cluster-h78zv-h94mh-worker-eastus3-jhvff <none> <none> node-resolver-689j7 1/1 Running 0 10m 10.0.2.5 aro-cluster-h78zv-h94mh-worker-eastus1-99l7n <none> <none> node-resolver-8qhls 1/1 Running 0 10m 10.0.0.7 aro-cluster-h78zv-h94mh-master-2 <none> <none> node-resolver-nv8mq 1/1 Running 0 10m 10.0.2.6 aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh <none> <none> node-resolver-r52v7 1/1 Running 0 10m 10.0.0.10 aro-cluster-h78zv-h94mh-master-0 <none> <none> node-resolver-z8d4n 1/1 Running 0 10m 10.0.0.9 aro-cluster-h78zv-h94mh-master-1 <none> <none> $ oc get pod -n project-100 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES tools-54f4d6844b-lr6z9 1/1 Running 0 18h 10.131.0.40 aro-cluster-h78zv-h94mh-worker-eastus1-99l7n <none> <none> $ oc rsh -n project-100 tools-54f4d6844b-lr6z9 sh-4.4$ host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1032 x1-9-foobar.bla.example.net. _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1039 x1-9-foobar.bla.example.net. _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1043 x1-9-foobar.bla.example.net. _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1048 x1-9-foobar.bla.example.net. _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1049 x1-9-foobar.bla.example.net. _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1050 x1-9-foobar.bla.example.net. --- https://mirror.openshift.com/pub/openshift-v4/x86_64/clients/ocp/4.15.0-rc.2/release.txt - using quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9e8ffba7854f3f02e8940ddcb2636ceb4773db77872ff639a447c4bab3a69ecc $ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES dns-default-gcs7s 2/2 Running 0 5m 10.128.2.52 aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh <none> <none> dns-default-mnbh4 2/2 Running 0 4m37s 10.129.0.46 aro-cluster-h78zv-h94mh-master-2 <none> <none> dns-default-p2s6v 2/2 Running 0 3m55s 10.130.1.77 aro-cluster-h78zv-h94mh-master-1 <none> <none> dns-default-svccn 2/2 Running 0 3m13s 10.128.1.128 aro-cluster-h78zv-h94mh-master-0 <none> <none> dns-default-tgktg 2/2 Running 0 3m34s 10.131.0.50 aro-cluster-h78zv-h94mh-worker-eastus1-99l7n <none> <none> dns-default-xd5vq 2/2 Running 0 4m16s 10.129.2.191 aro-cluster-h78zv-h94mh-worker-eastus3-jhvff <none> <none> node-resolver-2nmpx 1/1 Running 0 18m 10.0.2.4 aro-cluster-h78zv-h94mh-worker-eastus3-jhvff <none> <none> node-resolver-689j7 1/1 Running 0 18m 10.0.2.5 aro-cluster-h78zv-h94mh-worker-eastus1-99l7n <none> <none> node-resolver-8qhls 1/1 Running 0 18m 10.0.0.7 aro-cluster-h78zv-h94mh-master-2 <none> <none> node-resolver-nv8mq 1/1 Running 0 18m 10.0.2.6 aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh <none> <none> node-resolver-r52v7 1/1 Running 0 18m 10.0.0.10 aro-cluster-h78zv-h94mh-master-0 <none> <none> node-resolver-z8d4n 1/1 Running 0 18m 10.0.0.9 aro-cluster-h78zv-h94mh-master-1 <none> <none> $ oc get pod -n project-100 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES tools-54f4d6844b-lr6z9 1/1 Running 0 18h 10.131.0.40 aro-cluster-h78zv-h94mh-worker-eastus1-99l7n <none> <none> $ oc rsh -n project-100 tools-54f4d6844b-lr6z9 sh-4.4$ host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net Host _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net not found: 2(SERVFAIL) $ oc logs dns-default-tgktg Defaulted container "dns" out of: dns, kube-rbac-proxy .:5353 hostname.bind.:5353 example.net.:5353 [INFO] plugin/reload: Running configuration SHA512 = 8efa6675505d17551d17ca1e2ca45506a731dbab1f53dd687d37cb98dbaf4987a90622b6b030fe1643ba2cd17198a813ba9302b84ad729de4848f8998e768605 CoreDNS-1.11.1 linux/amd64, go1.20.10 X:strictfipsruntime, [INFO] 10.131.0.40:35246 - 61734 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.003577431s [ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size [INFO] 10.131.0.40:35246 - 61734 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.000969251s [ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size --- quay.io/rhn_support_sreber/coredns:latest - based on https://github.com/coredns/coredns master branch build on January 19th 2024 (suspecting https://github.com/coredns/coredns/pull/6277 to be the fix) $ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES dns-default-bpjpn 2/2 Running 0 2m22s 10.130.1.78 aro-cluster-h78zv-h94mh-master-1 <none> <none> dns-default-c7wcz 2/2 Running 0 99s 10.131.0.51 aro-cluster-h78zv-h94mh-worker-eastus1-99l7n <none> <none> dns-default-d7qjz 2/2 Running 0 3m6s 10.129.2.193 aro-cluster-h78zv-h94mh-worker-eastus3-jhvff <none> <none> dns-default-dkvtp 2/2 Running 0 78s 10.128.1.131 aro-cluster-h78zv-h94mh-master-0 <none> <none> dns-default-t6sv7 2/2 Running 0 2m44s 10.129.0.47 aro-cluster-h78zv-h94mh-master-2 <none> <none> dns-default-vf9f6 2/2 Running 0 2m 10.128.2.53 aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh <none> <none> node-resolver-2nmpx 1/1 Running 0 24m 10.0.2.4 aro-cluster-h78zv-h94mh-worker-eastus3-jhvff <none> <none> node-resolver-689j7 1/1 Running 0 24m 10.0.2.5 aro-cluster-h78zv-h94mh-worker-eastus1-99l7n <none> <none> node-resolver-8qhls 1/1 Running 0 24m 10.0.0.7 aro-cluster-h78zv-h94mh-master-2 <none> <none> node-resolver-nv8mq 1/1 Running 0 24m 10.0.2.6 aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh <none> <none> node-resolver-r52v7 1/1 Running 0 24m 10.0.0.10 aro-cluster-h78zv-h94mh-master-0 <none> <none> node-resolver-z8d4n 1/1 Running 0 24m 10.0.0.9 aro-cluster-h78zv-h94mh-master-1 <none> <none> $ oc get pod -n project-100 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES tools-54f4d6844b-lr6z9 1/1 Running 0 18h 10.131.0.40 aro-cluster-h78zv-h94mh-worker-eastus1-99l7n <none> <none> $ oc rsh -n project-100 tools-54f4d6844b-lr6z9 sh-4.4$ host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1032 x1-9-foobar.bla.example.net. _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1039 x1-9-foobar.bla.example.net. _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1043 x1-9-foobar.bla.example.net. _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1048 x1-9-foobar.bla.example.net. _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1049 x1-9-foobar.bla.example.net. _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1050 x1-9-foobar.bla.example.net. --- Back wth OpenShift Container Platform 4.13.27 but adjusting `CoreDNS` configuration. Defining specific forwardPlugin and enforcing TCP $ oc get dns.operator default -o yaml apiVersion: operator.openshift.io/v1 kind: DNS metadata: creationTimestamp: "2024-01-11T09:14:03Z" finalizers: - dns.operator.openshift.io/dns-controller generation: 7 name: default resourceVersion: "4230436" uid: c8f5c627-2010-4c4a-a5fe-ed87f320e427 spec: logLevel: Normal nodePlacement: {} operatorLogLevel: Normal servers: - forwardPlugin: policy: Random protocolStrategy: TCP upstreams: - 10.0.0.9 name: example zones: - example.net upstreamResolvers: policy: Sequential transportConfig: {} upstreams: - port: 53 type: SystemResolvConf status: clusterDomain: cluster.local clusterIP: 172.30.0.10 conditions: - lastTransitionTime: "2024-01-19T08:27:21Z" message: Enough DNS pods are available, and the DNS service has a cluster IP address. reason: AsExpected status: "False" type: Degraded - lastTransitionTime: "2024-01-19T08:28:03Z" message: All DNS and node-resolver pods are available, and the DNS service has a cluster IP address. reason: AsExpected status: "False" type: Progressing - lastTransitionTime: "2024-01-19T08:00:02Z" message: The DNS daemonset has available pods, and the DNS service has a cluster IP address. reason: AsExpected status: "True" type: Available - lastTransitionTime: "2024-01-11T09:14:04Z" message: DNS Operator can be upgraded reason: AsExpected status: "True" type: Upgradeable $ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES dns-default-frdkm 2/2 Running 0 3m5s 10.131.0.52 aro-cluster-h78zv-h94mh-worker-eastus1-99l7n <none> <none> dns-default-jsfkb 2/2 Running 0 99s 10.129.0.49 aro-cluster-h78zv-h94mh-master-2 <none> <none> dns-default-jzzqc 2/2 Running 0 2m21s 10.128.2.54 aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh <none> <none> dns-default-sgf4h 2/2 Running 0 2m 10.130.1.79 aro-cluster-h78zv-h94mh-master-1 <none> <none> dns-default-t8nn7 2/2 Running 0 2m44s 10.129.2.194 aro-cluster-h78zv-h94mh-worker-eastus3-jhvff <none> <none> dns-default-xmvqg 2/2 Running 0 3m27s 10.128.1.133 aro-cluster-h78zv-h94mh-master-0 <none> <none> node-resolver-2nmpx 1/1 Running 0 29m 10.0.2.4 aro-cluster-h78zv-h94mh-worker-eastus3-jhvff <none> <none> node-resolver-689j7 1/1 Running 0 29m 10.0.2.5 aro-cluster-h78zv-h94mh-worker-eastus1-99l7n <none> <none> node-resolver-8qhls 1/1 Running 0 29m 10.0.0.7 aro-cluster-h78zv-h94mh-master-2 <none> <none> node-resolver-nv8mq 1/1 Running 0 29m 10.0.2.6 aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh <none> <none> node-resolver-r52v7 1/1 Running 0 29m 10.0.0.10 aro-cluster-h78zv-h94mh-master-0 <none> <none> node-resolver-z8d4n 1/1 Running 0 29m 10.0.0.9 aro-cluster-h78zv-h94mh-master-1 <none> <none> $ oc get pod -n project-100 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES tools-54f4d6844b-lr6z9 1/1 Running 0 18h 10.131.0.40 aro-cluster-h78zv-h94mh-worker-eastus1-99l7n <none> <none> $ oc rsh -n project-100 tools-54f4d6844b-lr6z9 sh-4.4$ host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1032 x1-9-foobar.bla.example.net. _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1039 x1-9-foobar.bla.example.net. _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1043 x1-9-foobar.bla.example.net. _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1048 x1-9-foobar.bla.example.net. _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1049 x1-9-foobar.bla.example.net. _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net has SRV record 0 0 1050 x1-9-foobar.bla.example.net. --- Back wth OpenShift Container Platform 4.13.27 but now, forcing TCP on pod level $ oc get deployment tools -n project-100 -o yaml apiVersion: apps/v1 kind: Deployment metadata: annotations: alpha.image.policy.openshift.io/resolve-names: '*' app.openshift.io/route-disabled: "false" deployment.kubernetes.io/revision: "5" image.openshift.io/triggers: '[{"from":{"kind":"ImageStreamTag","name":"tools:latest","namespace":"project-100"},"fieldPath":"spec.template.spec.containers[?(@.name==\"tools\")].image","pause":"false"}]' openshift.io/generated-by: OpenShiftWebConsole creationTimestamp: "2024-01-17T11:22:05Z" generation: 5 labels: app: tools app.kubernetes.io/component: tools app.kubernetes.io/instance: tools app.kubernetes.io/name: tools app.kubernetes.io/part-of: tools app.openshift.io/runtime: other-linux app.openshift.io/runtime-namespace: project-100 name: tools namespace: project-100 resourceVersion: "4232839" uid: a8157243-71e1-4597-9aa5-497afed5f722 spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: app: tools strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: annotations: openshift.io/generated-by: OpenShiftWebConsole creationTimestamp: null labels: app: tools deployment: tools spec: containers: - command: - /bin/bash - -c - while true; do sleep 1;done image: image-registry.openshift-image-registry.svc:5000/project-100/tools@sha256:fba289d2ff20df2bfe38aa58fa3e491bbecf09e90e96b3c9b8c38f786dc2efb8 imagePullPolicy: Always name: tools ports: - containerPort: 8080 protocol: TCP resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsConfig: options: - name: use-vc dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30 status: availableReplicas: 1 conditions: - lastTransitionTime: "2024-01-17T11:23:56Z" lastUpdateTime: "2024-01-17T11:23:56Z" message: Deployment has minimum availability. reason: MinimumReplicasAvailable status: "True" type: Available - lastTransitionTime: "2024-01-17T11:22:05Z" lastUpdateTime: "2024-01-19T08:33:28Z" message: ReplicaSet "tools-6749b4cf47" has successfully progressed. reason: NewReplicaSetAvailable status: "True" type: Progressing observedGeneration: 5 readyReplicas: 1 replicas: 1 updatedReplicas: 1 $ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES dns-default-7kfzh 2/2 Running 0 2m25s 10.129.2.196 aro-cluster-h78zv-h94mh-worker-eastus3-jhvff <none> <none> dns-default-g4mtd 2/2 Running 0 2m25s 10.128.2.55 aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh <none> <none> dns-default-l4xkg 2/2 Running 0 2m26s 10.129.0.50 aro-cluster-h78zv-h94mh-master-2 <none> <none> dns-default-l7rq8 2/2 Running 0 2m25s 10.128.1.135 aro-cluster-h78zv-h94mh-master-0 <none> <none> dns-default-lt6zx 2/2 Running 0 2m26s 10.131.0.53 aro-cluster-h78zv-h94mh-worker-eastus1-99l7n <none> <none> dns-default-t6bzl 2/2 Running 0 2m25s 10.130.1.82 aro-cluster-h78zv-h94mh-master-1 <none> <none> node-resolver-279mf 1/1 Running 0 2m24s 10.0.2.6 aro-cluster-h78zv-h94mh-worker-eastus2-mlrxh <none> <none> node-resolver-2bzfc 1/1 Running 0 2m24s 10.0.2.4 aro-cluster-h78zv-h94mh-worker-eastus3-jhvff <none> <none> node-resolver-bdz4m 1/1 Running 0 2m24s 10.0.0.7 aro-cluster-h78zv-h94mh-master-2 <none> <none> node-resolver-jrv2w 1/1 Running 0 2m24s 10.0.0.9 aro-cluster-h78zv-h94mh-master-1 <none> <none> node-resolver-lbfg5 1/1 Running 0 2m23s 10.0.0.10 aro-cluster-h78zv-h94mh-master-0 <none> <none> node-resolver-qnm92 1/1 Running 0 2m24s 10.0.2.5 aro-cluster-h78zv-h94mh-worker-eastus1-99l7n <none> <none> $ oc get pod -n project-100 -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES tools-6749b4cf47-gmw9v 1/1 Running 0 50s 10.131.0.54 aro-cluster-h78zv-h94mh-worker-eastus1-99l7n <none> <none> $ oc rsh -n project-100 tools-6749b4cf47-gmw9v sh-4.4$ cat /etc/resolv.conf search project-100.svc.cluster.local svc.cluster.local cluster.local khrmlwa2zp4e1oisi1qjtoxwrc.bx.internal.cloudapp.net nameserver 172.30.0.10 options ndots:5 use-vc sh-4.4$ host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net Host _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net not found: 2(SERVFAIL) $ oc logs dns-default-lt6zx Defaulted container "dns" out of: dns, kube-rbac-proxy .:5353 hostname.bind.:5353 example.xyz.:5353 [INFO] plugin/reload: Running configuration SHA512 = 79d17b9fc0f61d2c6db13a0f7f3d0a873c4d86ab5cba90c3819a5b57a48fac2ef0fb644b55e959984cd51377bff0db04f399a341a584c466e540a0d7501340f7 CoreDNS-1.10.1 linux/amd64, go1.19.13 X:strictfipsruntime, [INFO] 10.131.0.40:51367 - 22867 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.00024781s [ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size [INFO] 10.131.0.40:51367 - 22867 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.00096551s [ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size [INFO] 10.131.0.54:44935 - 3087 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.000619524s [ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size [INFO] 10.131.0.54:44935 - 3087 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.000369584s [ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.13, 4.14, 4.15
How reproducible:
Always
Steps to Reproduce:
1. Run "host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net" inside a pod
Actual results:
dns-default pod is reporting below error when running the query. [INFO] 10.131.0.40:39333 - 54228 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.001868103s [ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size [INFO] 10.131.0.40:39333 - 54228 "SRV IN _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. udp 76 false 512" - - 0 5.003223099s [ERROR] plugin/errors: 2 _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net. SRV: dns: overflowing header size And the command "host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net" will fail. sh-4.4$ host -t srv _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net Host _example._tcp.foo-bar-abc-123-xyz-456-foo-000.abcde.example.net not found: 2(SERVFAIL)
Expected results:
No error reported in dns-default pod and query to actually return expected result
Additional info:
I suspect https://github.com/coredns/coredns/issues/5953 respectively https://github.com/coredns/coredns/pull/6277 being related. Hence built CoreDNS from master branch and created quay.io/rhn_support_sreber/coredns:latest. When running that Image in dns-default pod resolving the host query works again.
Please review the following PR: https://github.com/openshift/cluster-baremetal-operator/pull/392
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-27341. The following is the description of the original issue:
—
: [bz-Routing] clusteroperator/ingress should not change |
Has been failing for over a month in the e2e-metal-ipi-sdn-bm-upgrade jobs
I think this is because there are only two worker nodes in the BM environment and some HA services loose redundancy when one of the workers is rebooted.
In the medium term I hope to add another node to each cluster but in the sort term we should skip the test.
As this shows tls: bad certificate from kube-apiserver operator, for example, https://reportportal-openshift.apps.ocp-c1.prod.psi.redhat.com/ui/#prow/launches/all/470214, checked its must-gather: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-aws-ipi-imdsv2-fips-f14/1726036030588456960/artifacts/aws-ipi-imdsv2-fips-f14/gather-must-gather/artifacts/
MacBook-Pro:~ jianzhang$ omg logs prometheus-operator-admission-webhook-6bbdbc47df-jd5mb | grep "TLS handshake" 2023-11-27 10:11:50.687 | WARNING | omg.utils.load_yaml:<module>:10 - yaml.CSafeLoader failed to load, using SafeLoader 2023-11-19T00:57:08.318983249Z ts=2023-11-19T00:57:08.318923708Z caller=stdlib.go:105 caller=server.go:3215 msg="http: TLS handshake error from 10.129.0.35:48334: remote error: tls: bad certificate" 2023-11-19T00:57:10.336569986Z ts=2023-11-19T00:57:10.336505695Z caller=stdlib.go:105 caller=server.go:3215 msg="http: TLS handshake error from 10.129.0.35:48342: remote error: tls: bad certificate" ... MacBook-Pro:~ jianzhang$ omg get pods -A -o wide | grep "10.129.0.35" 2023-11-27 10:12:16.382 | WARNING | omg.utils.load_yaml:<module>:10 - yaml.CSafeLoader failed to load, using SafeLoader openshift-kube-apiserver-operator kube-apiserver-operator-f78c754f9-rbhw9 1/1 Running 2 5h27m 10.129.0.35 ip-10-0-107-238.ec2.internal
for more information slack - https://redhat-internal.slack.com/archives/CC3CZCQHM/p1700473278471309
Description of problem:
Picked up 4.14-ec-4 (which uses cgroups v1 as default) and trying to create a cluster with following PerformanceProfile (and corresponding mcp) by placing them in the manifests folder,
apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: clusterbotpp spec: cpu: isolated: "1-3" reserved: "0" realTimeKernel: enabled: false nodeSelector: node-role.kubernetes.io/worker: "" machineConfigPoolSelector: pools.operator.machineconfiguration.openshift.io/worker: ""
and,
apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: worker spec: machineConfigSelector: matchLabels: machineconfiguration.openshift.io/role: worker nodeSelector: matchLabels: node-role.kubernetes.io/worker: ""
The cluster often fails to install because bootkube spends a lot of time chasing this error,
Sep 06 18:32:43 ip-10-0-145-107 bootkube.sh[4925]: Created "clusterbotpp_kubeletconfig.yaml" kubeletconfigs.v1.machineconfiguration.openshift.io/performance-clusterbotpp -n Sep 06 18:32:43 ip-10-0-145-107 bootkube.sh[4925]: Failed to update status for the "clusterbotpp_kubeletconfig.yaml" kubeletconfigs.v1.machineconfiguration.openshift.io/performance-clusterbotpp -n : Operation cannot be fulfilled on kubeletconfigs.machineconfiguration.openshift.io "performance-clusterbotpp": StorageError: invalid object, Code: 4, Key: /kubernetes.io/machineconfiguration.openshift.io/kubeletconfigs/performance-clusterbotpp, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 11f98d74-af1b-4a4c-9692-6dce56ee5cd9, UID in object meta: Sep 06 18:32:43 ip-10-0-145-107 bootkube.sh[4925]: [#1717] failed to create some manifests: Sep 06 18:32:43 ip-10-0-145-107 bootkube.sh[4925]: "clusterbotpp_kubeletconfig.yaml": failed to update status for kubeletconfigs.v1.machineconfiguration.openshift.io/performance-clusterbotpp -n : Operation cannot be fulfilled on kubeletconfigs.machineconfiguration.openshift.io "performance-clusterbotpp": StorageError: invalid object, Code: 4, Key: /kubernetes.io/machineconfiguration.openshift.io/kubeletconfigs/performance-clusterbotpp, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 11f98d74-af1b-4a4c-9692-6dce56ee5cd9, UID in object meta: Sep 06 18:32:43 ip-10-0-145-107 bootkube.sh[4925]: Created "clusterbotpp_kubeletconfig.yaml" kubeletconfigs.v1.machineconfiguration.openshift.io/performance-clusterbotpp -n Sep 06 18:32:43 ip-10-0-145-107 bootkube.sh[4925]: Failed to update status for the "clusterbotpp_kubeletconfig.yaml" kubeletconfigs.v1.machineconfiguration.openshift.io/performance-clusterbotpp -n : Operation cannot be fulfilled on kubeletconfigs.machineconfiguration.openshift.io "performance-clusterbotpp": StorageError: invalid object, Code: 4, Key: /kubernetes.io/machineconfiguration.openshift.io/kubeletconfigs/performance-clusterbotpp, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 597dfcf3-012d-4730-912a-78efabb920ba, UID in object meta:
This leads to worker nodes not getting ready in time, which leads to installer marking the cluster installation failed. Ironically, even after the cluster installer returns with failure, if you wait long enough (sometimes) I have observed the cluster eventually reconciles and the worker nodes get provisioned.
I am attaching the installation logs from one such run with this issue.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Often
Steps to Reproduce:
1. Try to install new cluster by placing PeformanceProfile in the manifests folder 2. 3.
Actual results:
Cluster installation failed.
Expected results:
Cluster installation should succeed.
Additional info:
Also, I didn't observe this occurring in 4.13.9.
Please review the following PR: https://github.com/openshift/ironic-agent-image/pull/88
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The number of control plane replicas defined in install-config.yaml (or agent-cluster-install.yaml) should be validated to check its set to 3, or 1 in the case of SNO. If set to another value the "create image" command should fail.
We recently had a case where the number of replicas was set to 2 and the installation failed. It would be good to catch this misconfiguration prior to the install.
Description of problem:
The default channel of 4.15, 4.16 clusters is stable-4.14.
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-01-03-193825
How reproducible:
Always
Steps to Reproduce:
1. Install a 4.16 cluster 2. Check default channel # oc adm upgrade warning: Cannot display available updates: Reason: VersionNotFound Message: Unable to retrieve available updates: currently reconciling cluster version 4.16.0-0.nightly-2024-01-03-193825 not found in the "stable-4.14" channel Cluster version is 4.16.0-0.nightly-2024-01-03-193825 Upgradeable=False Reason: MissingUpgradeableAnnotation Message: Cluster operator cloud-credential should not be upgraded between minor versions: Upgradeable annotation cloudcredential.openshift.io/upgradeable-to on cloudcredential.operator.openshift.io/cluster object needs updating before upgrade. See Manually Creating IAM documentation for instructions on preparing a cluster for upgrade. Upstream is unset, so the cluster will use an appropriate default. Channel: stable-4.14 3.
Actual results:
Default channel is stable-4.14 in a 4.16 cluster
Expected results:
Default channel should be stable-4.16 in a 4.16 cluster
Additional info:
4.15 cluster has the issue as well.
Description of problem:
ARO supplies a platform kubeletconfig to enable certain features, currently we use this to enable node sizing or enable autoSizingReserved. Customers want the ability to customize podPidsLimit and we have directed them to configure a second kubeletconfig.
When these kubeletconfigs are rendered into machineconfigs, the order of their application is nondeterministic: the MCs are suffixed by an increasing serial number based on the order the kubeletconfigs were created. This makes it impossible for the customer to ensure their PIDs limit is applied while still allowing ARO to maintain our platform defaults.
We need a way of supplying platform defaults while still allowing the customer to make supported modifications in a way that does not risk being reverted during upgrades or other maintenance.
This issue has manifested in two different ways:
During an upgrade from 4.11.31 to 4.12.40, a cluster had the order of kubeletconfig rendered machine configs reverse. We think that in older versions, the initial kubeletconfig did not get an mc-name-suffix annotation applied, but rendered to "99-worker-generated-kubelet" (no suffix). The customer-provided kubeletconfig rendered to the suffix "-1". During the upgrade, MCO saw this as a new kubeletconfig and assigned it the suffix "-2", effectively reversing their order. See the RCS document https://docs.google.com/document/d/19LuhieQhCGgKclerkeO1UOIdprOx367eCSuinIPaqXA
ARO wants to make updates to the platform defaults. We are changing from a kubeletconfig "aro-limits" to a kubeletconfig "dynamic-node". We want to be able to do this while still keeping it as defaults and if the customer has created their own kubeletconfig, the customer's should still take precedence. What we see is that the creation of a new kubeletconfig regardless of source overrides all other kubeletconfigs, causing the customer to lose their customization.
Version-Release number of selected component (if applicable):
4.12.40+
ARO's older kubeletconfig "aro-limits":
apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: labels: aro.openshift.io/limits: "" name: aro-limits spec: kubeletConfig: evictionHard: imagefs.available: 15% memory.available: 500Mi nodefs.available: 10% nodefs.inodesFree: 5% systemReserved: memory: 2000Mi machineConfigPoolSelector: matchLabels: aro.openshift.io/limits: ""
ARO's newer kubeletconfig, "dynamic-node"
apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: dynamic-node spec: autoSizingReserved: true machineConfigPoolSelector: matchExpressions: - key: machineconfiguration.openshift.io/mco-built-in operator: Exists
Customer's desired kubeletconfig:
apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: labels: arogcd.arogproj.io/instance: cluster-config name: default-pod-pids-limit spec: kubeletConfig: podPidsLimit: 2000000 machineConfigPoolSelector: matchExpressions: - key: pools.operator.machineconfiguration.io/worker operator: Exists
Description of problem:
Change UI to non en_US locale. Navigate to Builds - BuildConfigs Click on kebabmenu, 'Start last run' is in English
Version-Release number of selected component (if applicable):
4.14.0-rc.2
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Content is in English
Expected results:
Content should be localized
Additional info:
Reference screenshot https://drive.google.com/file/d/1XrQwpJxftcsvE8rPGvItTaCZ4Sr1Rj1l/view?usp=sharing
Description of problem:
The MAPI metric mapi_current_pending_csr fires even when there are no pending MAPI CSRs. However, there are non-MAPI CSRs present. It may not be appropriately scoping this metric to only it's CSRs.
Version-Release number of selected component (if applicable):
Observed in 4.11.25
How reproducible:
Consistent
Steps to Reproduce:
1. Install a component that uses CSRs (like ACM) but leave the CSRs in a pending state 2. Observe metric firing 3.
Actual results:
Metric is firing
Expected results:
Metric only fires if there are MAPI specific CSRs pending
Additional info:
This impacts SRE alerting
This is a clone of issue OCPBUGS-24834. The following is the description of the original issue:
—
Background:
CCO was made optional in https://issues.redhat.com/browse/OCPEDGE-69. CloudCredential was introduced as a new capability to openshift/api. We need to bump api at oc to include the CloudCredential capability so oc adm release extract works correctly.
Description of problem:
Some relevant CredentialsRequests are not extracted by the following command: oc adm release extract --credentials-requests --included --install-config=install-config.yaml ... where install-config.yaml looks like the following: ... capabilities: baselineCapabilitySet: None additionalEnabledCapabilities: - MachineAPI - CloudCredential platform: aws: ...
Logs:
... I1209 19:57:25.968783 79037 extract.go:418] Found manifest 0000_50_cloud-credential-operator_05-iam-ro-credentialsrequest.yaml I1209 19:57:25.968902 79037 extract.go:429] Excluding Group: "cloudcredential.openshift.io" Kind: "CredentialsRequest" Namespace: "openshift-cloud-credential-operator" Name: "cloud-credential-operator-iam-ro": unrecognized capability names: CloudCredential ...
Please review the following PR: https://github.com/openshift/csi-driver-nfs/pull/129
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Issue was found when analyzing bug https://issues.redhat.com/browse/OCPBUGS-19817
Version-Release number of selected component (if applicable):
4.15.0-0.ci-2023-09-25-165744
How reproducible:
everytime
Steps to Reproduce:
The cluster is ipsec cluster and enabled NS extension and ipsec service. 1. enable e-w ipsec & wait for cluster to settle 2. disable ipsec & wait for cluster to settle you'll observer ipsec pods are deleted
Actual results:
no pods
Expected results:
pods should stay see https://github.com/openshift/cluster-network-operator/blob/master/pkg/network/ovn_kubernetes.go#L314 // If IPsec is enabled for the first time, we start the daemonset. If it is // disabled after that, we do not stop the daemonset but only stop IPsec. // // TODO: We need to do this as, by default, we maintain IPsec state on the // node in order to maintain encrypted connectivity in the case of upgrades. // If we only unrender the IPsec daemonset, we will be unable to cleanup // the IPsec state on the node and the traffic will continue to be // encrypted.
Additional info:
The Cloud Credential operator was made optional in OCP 4.15, see https://issues.redhat.com/browse/OCPEDGE-69. The CloudCredential cap was added as a new capability.
However, for OCP 4.15 the disablement of CCO is only supported on BareMetal platforms, see https://issues.redhat.com/browse/OCPEDGE-69?focusedId=23595076&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-23595076.
We propose to guard against installations on non-BareMetal platforms without the CloudCredential cap, which could be implemented similar to https://issues.redhat.com/browse/OCPBUGS-15659. 
Please review the following PR: https://github.com/openshift/kubevirt-csi-driver/pull/23
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Issue - Profiles are degraded [1]even after applied due to below [2]error:
[1]
$oc get profile -A NAMESPACE NAME TUNED APPLIED DEGRADED AGE openshift-cluster-node-tuning-operator master0 rdpmc-patch-master True True 5d openshift-cluster-node-tuning-operator master1 rdpmc-patch-master True True 5d openshift-cluster-node-tuning-operator master2 rdpmc-patch-master True True 5d openshift-cluster-node-tuning-operator worker0 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker1 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker10 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker11 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker12 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker13 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker14 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker15 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker2 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker3 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker4 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker5 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker6 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker7 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker8 rdpmc-patch-worker True True 5d openshift-cluster-node-tuning-operator worker9 rdpmc-patch-worker True True 5d
[2]
lastTransitionTime: "2023-12-05T22:43:12Z" message: TuneD daemon issued one or more sysctl override message(s) during profile application. Use reapply_sysctl=true or remove conflicting sysctl net.core.rps_default_mask reason: TunedSysctlOverride status: "True"
If we see in rdpmc-patch-master tuned:
NAMESPACE NAME TUNED APPLIED DEGRADED AGE openshift-cluster-node-tuning-operator master0 rdpmc-patch-master True True 5d openshift-cluster-node-tuning-operator master1 rdpmc-patch-master True True 5d openshift-cluster-node-tuning-operator master2 rdpmc-patch-master True True 5d
We are configuring below in rdpmc-patch-master tuned:
$ oc get tuned rdpmc-patch-master -n openshift-cluster-node-tuning-operator -oyaml |less
spec:
profile:
- data: |
[main]
include=performance-patch-master
[sysfs]
/sys/devices/cpu/rdpmc = 2
name: rdpmc-patch-master
recommend:
Below in Performance-patch-master which is included in above tuned:
spec: profile: - data: | [main] summary=Custom tuned profile to adjust performance include=openshift-node-performance-master-profile [bootloader] cmdline_removeKernelArgs=-nohz_full=${isolated_cores}
Below(which is coming in error) is in openshift-node-performance-master-profile included in above tuned:
net.core.rps_default_mask=${not_isolated_cpumask}
RHEL BUg has been raised for the same https://issues.redhat.com/browse/RHEL-18972
Version-Release number of selected component (if applicable):{code:none}
4.14
Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/260
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver/pull/44
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/etcd/pull/215
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Tracker issue for bootimage bump in 4.15. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-20356.
the name in setup.cfg is incorrectly set as ironic-image
it should be ironic-agent-image
Upgrade to golang 1.20 for all assisted-installer components
Please review the following PR: https://github.com/openshift/cluster-api-provider-libvirt/pull/262
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Not issue, just upstream sync (or issue: multus is not up-to-date).
Please review the following PR: https://github.com/openshift/configmap-reload/pull/56
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/241
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Same as CNF-9173
Opened as a bug in order to backport for 4.15
This is a clone of issue OCPBUGS-25206. The following is the description of the original issue:
—
We need to reenable the e2e integration tests as soon as the operator is available again.
Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/1883
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/machine-api-provider-azure/pull/86
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-25881. The following is the description of the original issue:
—
Description of problem:
Copying BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2250911 on OCP side (as fix is needed on console). [UI] In Openshift-storage-client namespace, 'RWX' access mode RBD PVC with volumemode'Filesystem' can be created from Client. However, this is an invalid combination for RBD PVC creation From ODF Operator UI of other Platforms. Volume mode is not available when Cepfrbd storageclass and RWX access mode selected on other platform. This is visible in client operator view. This attempt to create PVc and stuck in pending state
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Deploy Provider Client setup. 2. From UI Create PVC, select storage class : ceph-rbd, RWX access mode, check filemode : in case of this bug 'Filesystem' and 'block' volume mode is visible on UI, select volumemode: Filesystem and create the PVC.
Actual results:
PVC Created and stuck in pending status. PVC event shows error like: Generated from openshift-storage-client.rbd.csi.ceph.com_csi-rbdplugin-provisioner-6d9dcb9fc7-vjj22_2bd4ede5-9418-4c8e-80ae-169b5cb4fa8012 times in the last 13 minutes failed to provision volume with StorageClass "ocs-storagecluster-ceph-rbd": rpc error: code = InvalidArgument desc = multi node access modes are only supported on rbd `block` type volumes
Expected results:
Volumemode should not be visible on page when PVC with RWX access mode and RBD storage class is selected.
Additional info:
Screenshots are attached to the BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2250911 https://bugzilla.redhat.com/show_bug.cgi?id=2250911#c3
Pre-requisites:
The following AlertmanagerConfig object will trigger a panic of the UWM prometheus operator:
apiVersion: monitoring.coreos.com/v1alpha1 kind: AlertmanagerConfig metadata: name: alertmanager-config labels: resource: prometheus spec: route: groupBy: ["..."] groupWait: 1m groupInterval: 1m repeatInterval: 12h receiver: "default_channel" routes: - matchers: - matchType: = name: severity value: warning receiver: teams receivers: - name: "default_channel" - name: teams msteamsConfigs: - webhookUrl: name: alertmanager-teams key: webhook
See https://github.com/prometheus-operator/prometheus-operator/issues/6082
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/121
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Reproducer:
1. On a GCP cluster, create an ingress controller with internal load balancer scope, like this:
apiVersion: operator.openshift.io/v1 kind: IngressController metadata: name: foo namespace: openshift-ingress-operator spec: domain: foo.<cluster-domain> endpointPublishingStrategy: type: LoadBalancerService loadBalancer: dnsManagementPolicy: Managed scope: Internal
2. Wait for load balancer service to complete rollout
$ oc -n openshift-ingress get service router-foo NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE router-foo LoadBalancer 172.30.101.233 10.0.128.5 80:32019/TCP,443:32729/TCP 81s
3. Edit ingress controller to set spec.endpointPublishingStrategy.loadBalancer.scope to External
the load balancer service (router-foo in this case) should get an external IP address, but currently it keeps the 10.x.x.x address that was already assigned.
Description of problem:
Port 22 is added to the worker node security group in TF install [1]: resource "aws_security_group_rule" "worker_ingress_ssh" { type = "ingress" security_group_id = aws_security_group.worker.id description = local.description protocol = "tcp" cidr_blocks = var.cidr_blocks from_port = 22 to_port = 22 } But it's missing in SDK install [2] [1] https://github.com/openshift/installer/blob/master/data/data/aws/cluster/vpc/sg-worker.tf#L39-L48 [2] https://github.com/openshift/installer/pull/7676/files#diff-c89a0152f7d51be6e3830081d1c166d9333628982773c154d8fc9a071c8ff765R272
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-03-31-180021
How reproducible:
Always
Steps to Reproduce:
1. Create a cluster using SDK installation method 2. 3.
Actual results:
See description.
Expected results:
Port 22 is added to worker node's security group.
Additional info:
Description of problem:
baremetal 4.14.0-rc.0 ipv6 sno cluster, login as admin user to admin console, there is not Observe menu on the left navigation bar, see picture, https://drive.google.com/file/d/13RAXPxtKhAElN9xf8bAmLJa0GI8pP0fH/view?usp=sharing, monitoring-plugin status is Failed, see: https://drive.google.com/file/d/1YsSaGdLT4bMn-6E-WyFWbOpwvDY4t6na/view?usp=sharing, error is
Failed to get a valid plugin manifest from /api/plugins/monitoring-plugin/ r: Bad Gateway
checked console logs, 9443: connect: connection refused
$ oc -n openshift-console logs console-6869f8f4f4-56mbj ... E0915 12:50:15.498589 1 handlers.go:164] GET request for "monitoring-plugin" plugin failed: Get "https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json": dial tcp [fd02::f735]:9443: connect: connection refused 2023/09/15 12:50:15 http: panic serving [fd01:0:0:1::2]:39156: runtime error: invalid memory address or nil pointer dereference goroutine 183760 [running]: net/http.(*conn).serve.func1() /usr/lib/golang/src/net/http/server.go:1854 +0xbf panic({0x3259140, 0x4fcc150}) /usr/lib/golang/src/runtime/panic.go:890 +0x263 github.com/openshift/console/pkg/plugins.(*PluginsHandler).proxyPluginRequest(0xc0003b5760, 0x2?, {0xc0009bc7d1, 0x11}, {0x3a41fa0, 0xc0002f6c40}, 0xb?) /go/src/github.com/openshift/console/pkg/plugins/handlers.go:165 +0x582 github.com/openshift/console/pkg/plugins.(*PluginsHandler).HandlePluginAssets(0xaa00000000000010?, {0x3a41fa0, 0xc0002f6c40}, 0xc0001f7500) /go/src/github.com/openshift/console/pkg/plugins/handlers.go:147 +0x26d github.com/openshift/console/pkg/server.(*Server).HTTPHandler.func23({0x3a41fa0?, 0xc0002f6c40?}, 0x7?) /go/src/github.com/openshift/console/pkg/server/server.go:604 +0x33 net/http.HandlerFunc.ServeHTTP(...) /usr/lib/golang/src/net/http/server.go:2122 github.com/openshift/console/pkg/server.authMiddleware.func1(0xc0001f7500?, {0x3a41fa0?, 0xc0002f6c40?}, 0xd?) /go/src/github.com/openshift/console/pkg/server/middleware.go:25 +0x31 github.com/openshift/console/pkg/server.authMiddlewareWithUser.func1({0x3a41fa0, 0xc0002f6c40}, 0xc0001f7500) /go/src/github.com/openshift/console/pkg/server/middleware.go:81 +0x46c net/http.HandlerFunc.ServeHTTP(0x5120938?, {0x3a41fa0?, 0xc0002f6c40?}, 0x7ffb6ea27f18?) /usr/lib/golang/src/net/http/server.go:2122 +0x2f net/http.StripPrefix.func1({0x3a41fa0, 0xc0002f6c40}, 0xc0001f7400) /usr/lib/golang/src/net/http/server.go:2165 +0x332 net/http.HandlerFunc.ServeHTTP(0xc001102c00?, {0x3a41fa0?, 0xc0002f6c40?}, 0xc000655a00?) /usr/lib/golang/src/net/http/server.go:2122 +0x2f net/http.(*ServeMux).ServeHTTP(0x34025e0?, {0x3a41fa0, 0xc0002f6c40}, 0xc0001f7400) /usr/lib/golang/src/net/http/server.go:2500 +0x149 github.com/openshift/console/pkg/server.securityHeadersMiddleware.func1({0x3a41fa0, 0xc0002f6c40}, 0x3305040?) /go/src/github.com/openshift/console/pkg/server/middleware.go:128 +0x3af net/http.HandlerFunc.ServeHTTP(0x0?, {0x3a41fa0?, 0xc0002f6c40?}, 0x11db52e?) /usr/lib/golang/src/net/http/server.go:2122 +0x2f net/http.serverHandler.ServeHTTP({0xc0008201e0?}, {0x3a41fa0, 0xc0002f6c40}, 0xc0001f7400) /usr/lib/golang/src/net/http/server.go:2936 +0x316 net/http.(*conn).serve(0xc0009b4120, {0x3a43e70, 0xc001223500}) /usr/lib/golang/src/net/http/server.go:1995 +0x612 created by net/http.(*Server).Serve /usr/lib/golang/src/net/http/server.go:3089 +0x5ed I0915 12:50:24.267777 1 handlers.go:118] User settings ConfigMap "user-settings-4b4c2f4d-159c-4358-bba3-3d87f113cd9b" already exist, will return existing data. I0915 12:50:24.267813 1 handlers.go:118] User settings ConfigMap "user-settings-4b4c2f4d-159c-4358-bba3-3d87f113cd9b" already exist, will return existing data. E0915 12:50:30.155515 1 handlers.go:164] GET request for "monitoring-plugin" plugin failed: Get "https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json": dial tcp [fd02::f735]:9443: connect: connection refused 2023/09/15 12:50:30 http: panic serving [fd01:0:0:1::2]:42990: runtime error: invalid memory address or nil pointer dereference
9443 port is Connection refused
$ oc -n openshift-monitoring get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES alertmanager-main-0 6/6 Running 6 3d22h fd01:0:0:1::564 sno-2 <none> <none> cluster-monitoring-operator-6cb777d488-nnpmx 1/1 Running 4 7d16h fd01:0:0:1::12 sno-2 <none> <none> kube-state-metrics-dc5f769bc-p97m7 3/3 Running 12 7d16h fd01:0:0:1::3b sno-2 <none> <none> monitoring-plugin-85bfb98485-d4g5x 1/1 Running 4 7d16h fd01:0:0:1::55 sno-2 <none> <none> node-exporter-ndnnj 2/2 Running 8 7d16h 2620:52:0:165::41 sno-2 <none> <none> openshift-state-metrics-78df59b4d5-j6r5s 3/3 Running 12 7d16h fd01:0:0:1::3a sno-2 <none> <none> prometheus-adapter-6f86f7d8f5-ttflf 1/1 Running 0 4h23m fd01:0:0:1::b10c sno-2 <none> <none> prometheus-k8s-0 6/6 Running 6 3d22h fd01:0:0:1::566 sno-2 <none> <none> prometheus-operator-7c94855989-csts2 2/2 Running 8 7d16h fd01:0:0:1::39 sno-2 <none> <none> prometheus-operator-admission-webhook-7bb64b88cd-bvq8m 1/1 Running 4 7d16h fd01:0:0:1::37 sno-2 <none> <none> thanos-querier-5bbb764599-vlztq 6/6 Running 6 3d22h fd01:0:0:1::56a sno-2 <none> <none> $ oc -n openshift-monitoring get svc monitoring-plugin NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE monitoring-plugin ClusterIP fd02::f735 <none> 9443/TCP 7d16h $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -v 'https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json' | jq * Trying fd02::f735... * TCP_NODELAY set * connect to fd02::f735 port 9443 failed: Connection refused * Failed to connect to monitoring-plugin.openshift-monitoring.svc.cluster.local port 9443: Connection refused * Closing connection 0 curl: (7) Failed to connect to monitoring-plugin.openshift-monitoring.svc.cluster.local port 9443: Connection refused command terminated with exit code 7
no such issue in other 4.14.0-rc.0 ipv4 cluster, but issue reproduced on other 4.14.0-rc.0 ipv6 cluster.
4.14.0-rc.0 ipv4 cluster,
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-rc.0 True False 20m Cluster version is 4.14.0-rc.0 $ oc -n openshift-monitoring get pod -o wide | grep monitoring-plugin monitoring-plugin-85bfb98485-nh428 1/1 Running 0 4m 10.128.0.107 ci-ln-pby4bj2-72292-l5q8v-master-0 <none> <none> $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k 'https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json' | jq ... { "name": "monitoring-plugin", "version": "1.0.0", "displayName": "OpenShift console monitoring plugin", "description": "This plugin adds the monitoring UI to the OpenShift web console", "dependencies": { "@console/pluginAPI": "*" }, "extensions": [ { "type": "console.page/route", "properties": { "exact": true, "path": "/monitoring", "component": { "$codeRef": "MonitoringUI" } } }, ...
meet issue "9443: Connection refused" in 4.14.0-rc.0 ipv6 cluster(launched cluster-bot cluster: launch 4.14.0-rc.0 metal,ipv6) and login console
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-rc.0 True False 44m Cluster version is 4.14.0-rc.0 $ oc -n openshift-monitoring get pod -o wide | grep monitoring-plugin monitoring-plugin-bd6ffdb5d-b5csk 1/1 Running 0 53m fd01:0:0:4::b worker-0.ostest.test.metalkube.org <none> <none> monitoring-plugin-bd6ffdb5d-vhtpf 1/1 Running 0 53m fd01:0:0:5::9 worker-2.ostest.test.metalkube.org <none> <none> $ oc -n openshift-monitoring get svc monitoring-plugin NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE monitoring-plugin ClusterIP fd02::402d <none> 9443/TCP 59m $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -v 'https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json' | jq * Trying fd02::402d... * TCP_NODELAY set * connect to fd02::402d port 9443 failed: Connection refused * Failed to connect to monitoring-plugin.openshift-monitoring.svc.cluster.local port 9443: Connection refused * Closing connection 0 curl: (7) Failed to connect to monitoring-plugin.openshift-monitoring.svc.cluster.local port 9443: Connection refused command terminated with exit code 7$ oc -n openshift-console get pod | grep console console-5cffbc7964-7ljft 1/1 Running 0 56m console-5cffbc7964-d864q 1/1 Running 0 56m$ oc -n openshift-console logs console-5cffbc7964-7ljft ... E0916 14:34:16.330117 1 handlers.go:164] GET request for "monitoring-plugin" plugin failed: Get "https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json": dial tcp [fd02::402d]:9443: connect: connection refused 2023/09/16 14:34:16 http: panic serving [fd01:0:0:4::2]:37680: runtime error: invalid memory address or nil pointer dereference goroutine 3985 [running]: net/http.(*conn).serve.func1() /usr/lib/golang/src/net/http/server.go:1854 +0xbf panic({0x3259140, 0x4fcc150}) /usr/lib/golang/src/runtime/panic.go:890 +0x263 github.com/openshift/console/pkg/plugins.(*PluginsHandler).proxyPluginRequest(0xc0008f6780, 0x2?, {0xc000665211, 0x11}, {0x3a41fa0, 0xc0009221c0}, 0xb?) /go/src/github.com/openshift/console/pkg/plugins/handlers.go:165 +0x582 github.com/openshift/console/pkg/plugins.(*PluginsHandler).HandlePluginAssets(0xfe00000000000010?, {0x3a41fa0, 0xc0009221c0}, 0xc000d8d600) /go/src/github.com/openshift/console/pkg/plugins/handlers.go:147 +0x26d github.com/openshift/console/pkg/server.(*Server).HTTPHandler.func23({0x3a41fa0?, 0xc0009221c0?}, 0x7?) /go/src/github.com/openshift/console/pkg/server/server.go:604 +0x33 net/http.HandlerFunc.ServeHTTP(...) /usr/lib/golang/src/net/http/server.go:2122 github.com/openshift/console/pkg/server.authMiddleware.func1(0xc000d8d600?, {0x3a41fa0?, 0xc0009221c0?}, 0xd?) /go/src/github.com/openshift/console/pkg/server/middleware.go:25 +0x31 github.com/openshift/console/pkg/server.authMiddlewareWithUser.func1({0x3a41fa0, 0xc0009221c0}, 0xc000d8d600) /go/src/github.com/openshift/console/pkg/server/middleware.go:81 +0x46c net/http.HandlerFunc.ServeHTTP(0xc000653830?, {0x3a41fa0?, 0xc0009221c0?}, 0x7f824506bf18?) /usr/lib/golang/src/net/http/server.go:2122 +0x2f net/http.StripPrefix.func1({0x3a41fa0, 0xc0009221c0}, 0xc000d8d500) /usr/lib/golang/src/net/http/server.go:2165 +0x332 net/http.HandlerFunc.ServeHTTP(0xc00007e800?, {0x3a41fa0?, 0xc0009221c0?}, 0xc000b2da00?) /usr/lib/golang/src/net/http/server.go:2122 +0x2f net/http.(*ServeMux).ServeHTTP(0x34025e0?, {0x3a41fa0, 0xc0009221c0}, 0xc000d8d500) /usr/lib/golang/src/net/http/server.go:2500 +0x149 github.com/openshift/console/pkg/server.securityHeadersMiddleware.func1({0x3a41fa0, 0xc0009221c0}, 0x3305040?) /go/src/github.com/openshift/console/pkg/server/middleware.go:128 +0x3af net/http.HandlerFunc.ServeHTTP(0x0?, {0x3a41fa0?, 0xc0009221c0?}, 0x11db52e?) /usr/lib/golang/src/net/http/server.go:2122 +0x2f net/http.serverHandler.ServeHTTP({0xc000db9b00?}, {0x3a41fa0, 0xc0009221c0}, 0xc000d8d500) /usr/lib/golang/src/net/http/server.go:2936 +0x316 net/http.(*conn).serve(0xc000653680, {0x3a43e70, 0xc000676f30}) /usr/lib/golang/src/net/http/server.go:1995 +0x612 created by net/http.(*Server).Serve /usr/lib/golang/src/net/http/server.go:3089 +0x5ed
Version-Release number of selected component (if applicable):
baremetal 4.14.0-rc.0 ipv6 sno cluster, $ token=`oc create token prometheus-k8s -n openshift-monitoring` $ $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=virt_platform' | jq { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "virt_platform", "baseboard_manufacturer": "Dell Inc.", "baseboard_product_name": "01J4WF", "bios_vendor": "Dell Inc.", "bios_version": "1.10.2", "container": "kube-rbac-proxy", "endpoint": "https", "instance": "sno-2", "job": "node-exporter", "namespace": "openshift-monitoring", "pod": "node-exporter-ndnnj", "prometheus": "openshift-monitoring/k8s", "service": "node-exporter", "system_manufacturer": "Dell Inc.", "system_product_name": "PowerEdge R750", "system_version": "Not Specified", "type": "none" }, "value": [ 1694785092.664, "1" ] } ] } }
How reproducible:
only seen on this cluster
Steps to Reproduce:
1. see the description 2. 3.
Actual results:
no Observe menu on admin console, monitoring-plugin is failed
Expected results:
no error
Description of problem:
oauthclients degraded condition that never gets removed, meaning once its set due to an issue on a cluster, it wont be unset
Version-Release number of selected component (if applicable):
How reproducible:
Sporadically, when the AuthStatusHandlerFailedApply condition is set on the console operator status conditions.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
As a developer trying to release GitOps Dynamic Plugin I want to have a flag to toggle static plugin so that it would be possible to backport to old static plugin.
The reason of this ticket is that OCP will have a release where they leave the static plugin as a fallback.
Slack thread: https://redhat-internal.slack.com/archives/C011BL0FEKZ/p1698853635030619
Related to GITOPS-2369: [DynamicPlugin] Remove static plugin from Console
<Defines what is not included in this story>
Set up a flag initialized by the dynamic plugin and disable the static plugin when the flag is set.
<Describes what this story depends on. Dependent Stories and EPICs should be linked to the story.>
Only one of static plugin and dynamic plugin will be displayed in console.
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Unknown
Verified
Unsatisfied
Description of problem:
ovnkube-node fails to start on a customer cluster (see OHSS-26032), the error message doesn't state which step of the startup process (or which Service or other object defined on the cluster) stops.
Version-Release number of selected component (if applicable):
How reproducible:
Unknown. After a Force Rebuild of the OVN databases the ovnkube-node doesn't start. The issue seems to be with a headless service with internalTrafficPolicy:Local which isn't allowed according to https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/2086-service-internal-traffic-policy/README.md#proposal
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
https://github.com/openshift/cluster-kube-apiserver-operator/pull/1392
configured HSTS for the KAS in standalone and we need to follow
Please review the following PR: https://github.com/openshift/cluster-monitoring-operator/pull/2084
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
When using a disconnected image registry which is hosted at a subdomain of the cluster domain, then Agent-based Installer fails to install a OKD/FCOS cluster. The rendezvous host starts bootkube.sh but fails because it cannot resolve the registry DNS name:
Oct 25 12:47:03 master-0 bootkube.sh[6462]: error: unable to read image virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:76562238a20f2f4dd45770f00730e20425edd376d30d58d7dafb5d6f02b208c5: Get "https://virthost.ostest.test.metalkube.org:5000/v2/": dial tcp: lookup virthost.ostest.test.metalkube.org: no such host Oct 25 12:47:03 master-0 systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE Oct 25 12:47:03 master-0 systemd[1]: bootkube.service: Failed with result 'exit-code'.
This hit OpenShift CI jobs 'okd-e2e-agent-compact-ipv4' and 'okd-e2e-agent-sno-ipv6' based on openshift-metal3/dev-scripts. An example would be a OCP cluster domain (which contains the cluster name) of `ostest.test.metalkube.org` and a disconnected image registry at `virthost.ostest.test.metalkube.org`.
Other diagnosis from the rendezvous host:
[core@master-0 ~]$ sudo podman pull virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:76562238a20f2f4dd45770f00730e20425edd376d30d58d7dafb5d6f02b208c5 Trying to pull virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:76562238a20f2f4dd45770f00730e20425edd376d30d58d7dafb5d6f02b208c5... Error: initializing source docker://virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:76562238a20f2f4dd45770f00730e20425edd376d30d58d7dafb5d6f02b208c5: pinging container registry virthost.ostest.test.metalkube.org:5000: Get "https://virthost.ostest.test.metalkube.org:5000/v2/": dial tcp: lookup virthost.ostest.test.metalkube.org: no such host
curl -u ocp-user:ocp-pass https://virthost.ostest.test.metalkube.org:5000/v2/_catalog curl: (6) Could not resolve host: virthost.ostest.test.metalkube.org
core@master-0 ~]$ dig +noall +answer virthost.ostest.test.metalkube.org ;; communications error to 127.0.0.1#53: connection refused ;; communications error to 127.0.0.1#53: connection refused ;; communications error to 127.0.0.1#53: connection refused virthost.ostest.test.metalkube.org. 0 IN A 192.168.111.1
After stopping systemd-resolved:
[core@master-0 ~]$ curl -u ocp-user:ocp-pass https://virthost.ostest.test.metalkube.org:5000/v2/_catalog {"repositories":["localimages/installer","localimages/local-release-image"]}
Report and diagnosis output above from Andrea Fasano.
This is a clone of issue OCPBUGS-27892. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Aggregator claims these tests only ran 4 times out of what looks like 10 jobs that ran to normal completion:
[sig-network-edge] Application behind service load balancer with PDB remains available using new connections
[sig-network-edge] Application behind service load balancer with PDB remains available using reused connections
However looking at one of the jobs not in the list of passes, we can see these tests ran:
Why is the aggregator missing this result somehow?
Description of problem:
Since moving to a dynamic plugin, the monitoring UI will not work when running locally unless some extra steps are taken. Bridge must be configured to use this plugin, which needs to be running alongside it. Our readme doesn't include this information or instructions.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. Read the readme
Actual results:
The readme does not include instructions for running monitoring locally
Expected results:
The readme includes instructions for running monitoring locally
This is a clone of issue OCPBUGS-18699. The following is the description of the original issue:
—
Description of problem:
Openshift Console shows "Info alert:Non-printable file detected. File contains non-printable characters. Preview is not available." while edit an XML file type configmaps.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create configmap from file: # oc create cm test-cm --from-file=server.xml=server.xml configmap/test-cm created 2. If we try to edit the configmap in the OCP console we see the following error: Info alert:Non-printable file detected. File contains non-printable characters. Preview is not available.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
In https://github.com/openshift/installer/pull/7182 support was added to include AdditionalTrustBundle in the installconfigOverride for assisted-service in order to support Proxy with AdditionalTrustBundle. With the recent change to assisted-service https://github.com/openshift/assisted-service/pull/5357 to add it to the API we can remove setting this in installconfigOverride.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Not issue, just upstream sync (or issue: multus is not up-to-date).
This is a clone of issue OCPBUGS-27247. The following is the description of the original issue:
—
Description of problem:
in UPI cluster, there is no MachineSets and Machines resource, when user visits Machines and MachineSets list page, we will see simple text 'Not found'
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-16-113018
How reproducible:
Always
Steps to Reproduce:
1. setup UPI cluster 2. goes to MachineSets and Machines list page, check the empty state message
Actual results:
2. we just simply show 'Not found' text
Expected results:
2. for other resources, we show richer text 'No <resourcekind> found', so we should also show 'No Machines found' and 'No MachineSets found' for these pages
Additional info:
Description of problem:
Alert notification receiver created through web console creates receiver with field match which is deprecated instead of matchers and when match is changed to matchers causes Alertmanager pods to crashloopbackoff state throwing the error: ~~~ ts=2023-11-14T08:42:39.694Z caller=coordinator.go:118 level=error component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/config_out/alertmanager.env.yaml err="yaml: unmarshal errors:\n line 51: cannot unmarshal !!map into []string" ~~~
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create alert notification receiver through web console. Administration-->configuration-->Alertmanager-->create receiver-->add receiver 2. Check the yaml created which would contain route section with match and not matchers. 3. correct the match to matchers and not change the matchers defined like severity or alertname correctly . 4. Restart the Alertmanager pods which leads to crashloopbackoff state.
Actual results:
Alert notification receiver uses match field
Expected results:
Alert notification receiver should use matchers filed
Additional info:
Openshift data foundation installation wizard will be having option to enter role arn details in an AWS STS enabled OCP cluster. But this particular field is not letting to enter any values, the moment we type anything it got auto populated with [object Object] and after that we cant add or paste anything to it.
Tried to inspect the page and add element and on pressing install button. It throws below error:
Converting circular structure to JSON --> starting at object with constructor 'HTMLInputElement' | property '__reactFiber$rrh47yimfa' -> object with constructor 'Lu' — property 'stateNode' closes the circle
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The validation and status shown from "wait-for bootstrap-complete" is sometimes inadequate or difficult to decipher because of the number of lines it prints out. The status and validation information is stored in the assisted-service database. agent-gather should query the database and log out the status/status_info columns for the cluster and hosts into a separate log file. A simple glance at this file would make triaging easier and faster.
Please review the following PR: https://github.com/openshift/openshift-apiserver/pull/390
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Add Audit configuration for hypershift Hosted Cluster not working as expected.
Version-Release number of selected component (if applicable):
# oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.0-0.nightly-2023-05-04-090524 True False 15m Cluster version is 4.13.0-0.nightly-2023-05-04-090524
How reproducible:
Always
Steps to Reproduce:
1. Get hypershift hosted cluster detail from management cluster. # hostedcluster=$( oc get -n clusters hostedclusters -o json | jq -r .items[].metadata.name) 2. Apply audit profile for hypershift hosted cluster. # oc patch HostedCluster $hostedcluster -n clusters -p '{"spec": {"configuration": {"apiServer": {"audit": {"profile": "WriteRequestBodies"}}}}}' --type merge hostedcluster.hypershift.openshift.io/85ea85757a5a14355124 patched # oc get HostedCluster $hostedcluster -n clusters -ojson | jq .spec.configuration.apiServer.audit { "profile": "WriteRequestBodies" } 3. Check Pod or operator restart to apply configuration changes. # oc get pods -l app=kube-apiserver -n clusters-${hostedcluster} NAME READY STATUS RESTARTS AGE kube-apiserver-7c98b66949-9z6rw 5/5 Running 0 36m kube-apiserver-7c98b66949-gp5rx 5/5 Running 0 36m kube-apiserver-7c98b66949-wmk8x 5/5 Running 0 36m # oc get pods -l app=openshift-apiserver -n clusters-${hostedcluster} NAME READY STATUS RESTARTS AGE openshift-apiserver-dc4c84ff4-566z9 3/3 Running 0 29m openshift-apiserver-dc4c84ff4-99zq9 3/3 Running 0 29m openshift-apiserver-dc4c84ff4-9xdrz 3/3 Running 0 30m 4. Check generated audit log. # NOW=$(date -u "+%s"); echo "$NOW"; echo "$NOW" > now 1683711189 # kaspod=$(oc get pods -l app=kube-apiserver -n clusters-${hostedcluster} --no-headers -o=jsonpath={.items[0].metadata.name}) # oc logs $kaspod -c audit-logs -n clusters-${hostedcluster} > kas-audit.log # cat kas-audit.log | grep -iE '"verb":"(get|list|watch)","user":.*(requestObject|responseObject)' | jq -c 'select (.requestReceivedTimestamp | .[0:19] + "Z" | fromdateiso8601 > '"`cat now`)" | wc -l 0 # cat kas-audit.log | grep -iE '"verb":"(create|delete|patch|update)","user":.*(requestObject|responseObject)' | jq -c 'select (.requestReceivedTimestamp | .[0:19] + "Z" | fromdateiso8601 > '"`cat now`)" | wc -l 0 All results should not be zero In backend it should apply the configuration or pod/operator restart after configuration changes.
Actual results:
Config changes not applied in backend.Not operator & pod restart
Expected results:
Configuration should applied and pod & operator should restart after config changes.
Additional info:
I tried upgrading a 4.14 SNO cluster from one nightly image to another and, while on AWS the upgrade works fine, it fails on GCP.
Cluster Network Operator successfully upgrades ovn-kubernetes, but is stuck on cloud network config controller, which is on crash loop back off state because it receives a wrong IP address from the name server when trying to reach the API server. The node IP is actually 10.0.0.3 and the name server returns 10.0.0.2, which I suspect is the bootstrap node IP, but that's only my guess.
Some relevant logs:
$ oc get co network network 4.14.0-0.nightly-2023-08-15-200133 True True False 86m Deployment "/openshift-cloud-network-config-controller/cloud-network-config-controller" is not available (awaiting 1 nodes) $ oc get pods -n openshift-ovn-kubernetes -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ovnkube-control-plane-844c8f76fb-q4tvp 2/2 Running 3 24m 10.0.0.3 ci-ln-rij2p1b-72292-xmzf4-master-0 <none> <none> ovnkube-node-24kb7 10/10 Running 12 (13m ago) 25m 10.0.0.3 ci-ln-rij2p1b-72292-xmzf4-master-0 <none> <none> $ oc get pods -n openshift-cloud-network-config-controller -o wide openshift-cloud-network-config-controller cloud-network-config-controller-d65ccbc5b-dnt69 0/1 CrashLoopBackOff 15 (2m37s ago) 40m 10.128.0.141 ci-ln-rij2p1b-72292-xmzf4-master-0 <none> <none> $ oc logs -n openshift-cloud-network-config-controller cloud-network-config-controller-d65ccbc5b-dnt69 W0816 11:06:00.666825 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. F0816 11:06:30.673952 1 main.go:345] Error building controller runtime client: Get "https://api-int.ci-ln-rij2p1b-72292.gcp-2.ci.openshift.org:6443/api?timeout=32s": dial tcp 10.0.0.2:6443: i/o timeout
I also get 10.0.0.2 if I run a DNS query from the node itself or from a pod:
dig api-int.ci-ln-zp7dbyt-72292.gcp-2.ci.openshift.org ... ;; ANSWER SECTION: api-int.ci-ln-zp7dbyt-72292.gcp-2.ci.openshift.org. 60 IN A 10.0.0.2
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always.
Steps to Reproduce:
1.on clusterbot: launch 4.14 gcp,single-node 2. on a terminal: oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.14.0-0.nightly-2023-08-15-200133 --allow-explicit-upgrade --force
Actual results:
name server returns 10.0.0.2, so CNCC fails to reach the API server
Expected results:
name server should return 10.0.0.3
Must-gather: https://drive.google.com/file/d/1MDbsMgIQz7dE6e76z4ad95dwaxbSNrJM/view?usp=sharing
I'm assigning this bug first to the network edge team for a first pass. Please do reassign it if necessary.
Two payloads in a row, first had more failures, second had less but still broken.
Both exhibit this status on the console operator:
status: conditions: - lastTransitionTime: "2023-11-17T06:06:57Z" message: 'OAuthClientSyncDegraded: the server is currently unable to handle the request (get oauthclients.oauth.openshift.io console)' reason: OAuthClientSync_FailedRegister status: "True" type: Degraded
We are suspicious of this PR, however this change was before the payloads started failing, perhaps the issue only surfaces on upgrades once the change was in an accepted payload: https://github.com/openshift/console-operator/pull/808
There is also a hypershift PR that was only present in second failed payload, possibly a reaction to the problem but didn't fully fix? There were less failures in the second payload than the first: https://github.com/openshift/hypershift/pull/3151 ? If so, this will complicate a revert.
Discussion: https://redhat-internal.slack.com/archives/C01C8502FMM/p1700226091335339
CVO reporting:
Could not update service "openshift-cloud-controller-manager-operator/cloud-controller-manager-operator"
(111 of 613): resource may have been deleted
Reported by hypershift team who were first to notice: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1701279683347589
Description of problem:
When trying to deploy with an Internal publish strategy, DNS will fail because proxy VM cannot launch.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Set publishStrategy: Internal 2. Fail 3.
Actual results:
terraform fails
Expected results:
private cluster launches
Additional info:
Description of the problem:
A user with an invalid pull secret cannot correct the issue without deleting the infraenv
How reproducible:
100%
Steps to reproduce:
1. Create a malformed pull secret (like this one)
kind: Secret apiVersion: v1 metadata: name: pullsecret data: '.dockerconfigjson': eyJhdXRocyI6eyJub3RoaW5nLmNvbSI6eyJhdXRoIjoiWTJsaGJ3PT09PSIsImVtYWlsIjoiZmFrZUBjaWFvLmNvbSJ9fX0= type: 'kubernetes.io/dockerconfigjson'
2. Create an infraenv referencing this secret as the pull secret
3. Correct the pull secret
Actual results:
Infraenv still has error message about a malformed pull secret
Expected results:
Infraenv uses the updated pull secret
This is a clone of issue OCPBUGS-29304. The following is the description of the original issue:
—
Description of problem:
Sometimes the prometheus-operator's informer will be stuck because it receives objects that can't be converted to *v1.PartialObjectMetadata.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Not always
Steps to Reproduce:
1. Unknown 2. 3.
Actual results:
prometheus-operator logs show errors like 2024-02-09T08:29:35.478550608Z level=warn ts=2024-02-09T08:29:35.478491797Z caller=klog.go:108 component=k8s_client_runtime func=Warningf msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:110: failed to list *v1.PartialObjectMetadata: Get \"https://172.30.0.1:443/api/v1/secrets?resourceVersion=29022\": dial tcp 172.30.0.1:443: connect: connection refused" 2024-02-09T08:29:35.478592909Z level=error ts=2024-02-09T08:29:35.478541608Z caller=klog.go:116 component=k8s_client_runtime func=ErrorDepth msg="github.com/coreos/prometheus-operator/pkg/informers/informers.go:110: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Get \"https://172.30.0.1:443/api/v1/secrets?resourceVersion=29022\": dial tcp 172.30.0.1:443: connect: connection refused"
Expected results:
No error
Additional info:
The bug has been introduced in v0.70.0 by https://github.com/prometheus-operator/prometheus-operator/pull/5993 so it only affects 4.16 and 4.15.
Description of problem:
CAPI E2Es failing to start in some CAPI provider's release branches.
Failing with the following error: `go: errors parsing go.mod:94/tmp/tmp.ssf1LXKrim/go.mod:5: unknown directive: toolchain` https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-api/199/pull-ci-openshift-cluster-api-master-e2e-aws-capi-techpreview/1765512397532958720#1:build-log.txt%3A91-95
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is because the script launching the e2e is launching it from the `main` branch of the cluster-capi-operator (which has some backward incompabible go toolchain changes), rather than the correctly matching release branch.
Description of problem: Multus currently implements a certificate that exists for 10 minutes, we need to add configuration for certificates for 24 hours
Description of problem:
When creating an ingresscontroller with empty spec (or where spec.domain clashes with an existing IC), the ingresscontroller's status shows Admitted as "False" and reason is "Invalid". However, "route_controller_metrics_routes_per_shard" metric shows the shard in the Observe tab of the web-console. When the invalid ingresscontroller is deleted, the "route_controller_metrics_routes_per_shard" metric does not clear the row corresponding to the deleted invalid IC.
Version-Release number of selected component (if applicable):
4.12.0-ec5
How reproducible:
Always
Steps to Reproduce:
1. Create the invalid IC with the following spec: apiVersion: operator.openshift.io/v1 kind: IngressController metadata: name: ic-invalid namespace: openshift-ingress-operator spec: {} 2. Check the status of the IC: $ oc get ingresscontroller -n openshift-ingress-operator ic-invalid -oyaml apiVersion: operator.openshift.io/v1 kind: IngressController metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"operator.openshift.io/v1","kind":"IngressController","metadata":{"annotations":{},"name":"ic-invalid","namespace":"openshift-ingress-operator"},"spec":{}} creationTimestamp: "2022-11-11T12:53:41Z" generation: 1 name: ic-invalid namespace: openshift-ingress-operator resourceVersion: "97453" uid: 96eae28e-bb14-447e-822f-602f3a3bb378 spec: httpEmptyRequestsPolicy: Respond status: availableReplicas: 0 conditions: - lastTransitionTime: "2022-11-11T12:53:41Z" message: 'conflicts with: default' reason: Invalid status: "False" type: Admitted domain: apps.arsen-cluster1.devcluster.openshift.com endpointPublishingStrategy: loadBalancer: dnsManagementPolicy: Managed providerParameters: aws: classicLoadBalancer: connectionIdleTimeout: 0s type: Classic type: AWS scope: External type: LoadBalancerService observedGeneration: 1 selector: "" 3. Check the "route_metrics_controller_routes_per_shard" metric on the web-console 4. Delete the IC 5. Check the "route_metrics_controller_routes_per_shard" metric again on the web-console
Actual results:
As shown in the attached screenshot, "route_metrics_controller_routes_per_shard" metric adds one row for the invalid IC. This is not cleared even when the IC is deleted.
Expected results:
The "route_metrics_controller_routes_per_shard" metric should not add metric for invalid ICs. Additionally, when the invalid IC is deleted the metric should clear the corresponding row.
Additional info:
OLM creates certs and secrets for operators that it installs. Those secrets need to have ownership annotations.
Description of the problem:
Fix DNS wilcard domain validation.
DNS wildcard domain starts with validateNoWildcardDNS. The domain may have an optional trailing dot.
Currently the assumption is that the trailing dot is mandatory for the domain name.
How reproducible:
Steps to reproduce:
1.
2.
3.
Actual results:
Expected results:
$ oc get mc 01-master-kubelet -o json | jq -r '.spec.config.systemd.units | .[] | select(.name=="kubelet.service") | .contents' [Unit] Description=Kubernetes Kubelet Wants=rpc-statd.service network-online.target Requires=crio.service kubelet-auto-node-size.service After=network-online.target crio.service kubelet-auto-node-size.service After=ostree-finalize-staged.service [Service] Type=notify ExecStartPre=/bin/mkdir --parents /etc/kubernetes/manifests ExecStartPre=/bin/rm -f /var/lib/kubelet/cpu_manager_state ExecStartPre=/bin/rm -f /var/lib/kubelet/memory_manager_state EnvironmentFile=/etc/os-release EnvironmentFile=-/etc/kubernetes/kubelet-workaround EnvironmentFile=-/etc/kubernetes/kubelet-env EnvironmentFile=/etc/node-sizing.env ExecStart=/usr/local/bin/kubenswrapper \ /usr/bin/kubelet \ --config=/etc/kubernetes/kubelet.conf \ --bootstrap-kubeconfig=/etc/kubernetes/kubeconfig \ --kubeconfig=/var/lib/kubelet/kubeconfig \ --container-runtime=remote \ --container-runtime-endpoint=/var/run/crio/crio.sock \ --runtime-cgroups=/system.slice/crio.service \ --node-labels=node-role.kubernetes.io/control-plane,node-role.kubernetes.io/master,node.openshift.io/os_id=${ID} \ --node-ip=${KUBELET_NODE_IP} \ --minimum-container-ttl-duration=6m0s \ --cloud-provider= \ --volume-plugin-dir=/etc/kubernetes/kubelet-plugins/volume/exec \ \ --hostname-override=${KUBELET_NODE_NAME} \ --provider-id=${KUBELET_PROVIDERID} \ --register-with-taints=node-role.kubernetes.io/master=:NoSchedule \ --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:4c0a1b82501a416df4b926801bc3aa378d2762d0570a0791c6675db1a3365c62 \ --system-reserved=cpu=${SYSTEM_RESERVED_CPU},memory=${SYSTEM_RESERVED_MEMORY},ephemeral-storage=${SYSTEM_RESERVED_ES} \ --v=${KUBELET_LOG_LEVEL} Restart=always RestartSec=10 [Install] WantedBy=multi-user.target
Description of problem: Multus should implement per node certificates via integration in the CNO
Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/97
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/123
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Console crashes when clicked on "Sort by" table header on "Resources" tab of an Operand's instance page.
Version-Release number of selected component (if applicable):
4.13.0-0.ci-2022-11-07-202549
How reproducible:
100% (tested with 3 different Operands from 3 different Operators)
Steps to Reproduce:
1. Go to OperatorHub and install an Operator (e.g. Red Hat Integration - AMQ Streams) 2. After Operator is installed, create an Operand instance (e.g. Kafka) 3. Wait until Operand instance created successfully, go to instance's Details page --> Resource tab (e.g. Installed Operatorsamqstreams.v2.2.0-2Kafka details) 4. Click on any of Table Header to sort the resouece table
Actual results:
Console crashed
Expected results:
Resource table sorted accordingly.
Additional info:
I was testing this specifically with "OLM copiedCSVsDisabled" feature; however, I could still reproduce this crash after I set that feature back to `false`. Hence, not sure if it relates to that feature. Did cross-check with 4.12 nightly and can't reproduce this with 4.12 nightly
Bump k8s.io/pod-security-admission to v0.28.3
the okd build image job in ironic-agent-image is failing with the error message
Complete! % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 100 14 100 14 0 0 73 0 --:--:-- --:--:-- --:--:-- 73 File "<stdin>", line 1 404: Not Found ^ SyntaxError: illegal target for annotation INFO[2024-02-29T08:06:27Z] Ran for 4m3s ERRO[2024-02-29T08:06:27Z] Some steps failed: ERRO[2024-02-29T08:06:27Z] * could not run steps: step ironic-agent failed: error occurred handling build ironic-agent-amd64: the build ironic-agent-amd64 failed after 1m57s with reason DockerBuildFailed: Dockerfile build strategy has failed. INFO[2024-02-29T08:06:27Z] Reporting job state 'failed' with reason 'executing_graph:step_failed:building_project_image'
Description of problem:
Package that we use for Power VS has recently been revealed to be unmaintained. We should remove it in favor of maintained solutions.
Version-Release number of selected component (if applicable):
4.13.0 onward
How reproducible:
It's always used
Steps to Reproduce:
1. Deploy with IPI on Power VS 2. Use bluemix-go 3.
Actual results:
bluemix-go is used
Expected results:
bluemix-go should be avoided
Additional info:
Description of problem:
Standalone OpenShift allows customizing templates for OAuth via the oauth.config.openshift.io/cluster resource. In HyperShift, this is done via the HostedCluster.spec.configuration.oauth field. However, setting a reference to secrets in these fields does not take effect on a HyperShift cluster.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
always
Steps to Reproduce:
1.Create a HostedCluster and specify alternate templates for oauth via the HostedCluster.spec.configuration.oauth field. 2. View the oauth UI by attempting to log in to the OpenShift console. 3.
Actual results:
Different oauth templates do not take effect
Expected results:
Templates affect the look of the oauth login page
Additional info:
Description of problem:
Role assignment for Azure AD Workload Identity performed by ccoctl does not provide an option to scope role assignments to a resource group containing customer vnet in a byo vnet installation workflow. https://docs.openshift.com/container-platform/4.13/installing/installing_azure/installing-azure-vnet.html
Version-Release number of selected component (if applicable):
4.14.0
How reproducible:
100%
Steps to Reproduce:
1. Create Azure resource group and vnet for OpenShift within that resource group. 2. Create Azure AD Workload Identity infrastructure with ccoctl. 3. Follow steps to configure existing vnet for installation setting networkResourceGroupName within the install config. 4. Attempt cluster installation.
Actual results:
Cluster installation fails.
Expected results:
Cluster installation succeeds.
Additional info:
ccoctl must be extended to accept a parameter specifying the network resource group name and scope relevant component role assignments to the network resource group in addition to the installation resource group.
Description of problem:
In 4.13.z releases, the request-serving label is not present in the ignition-server-proxy deployment. The network policy in place prevents egress from the private router to pods that do not have the label, resulting in the ignition-server endpoint not being available from the outside.
Version-Release number of selected component (if applicable):
4.13.12 OCP, 4.14 HO
How reproducible:
Always
Steps to Reproduce:
1. Install latest HO 2. Create a HostedCluster with version 4.13.12 3. Wait for nodes to join
Actual results:
Nodes never join
Expected results:
Nodes join
Additional info:
Nodes are not joining because of the blocked egress from the router to the ignition-server-proxy
Please review the following PR: https://github.com/openshift/agent-installer-utils/pull/29
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-28662. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-25766. The following is the description of the original issue:
—
Seen in this 4.15 to 4.16 CI run:
: [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers 0s { event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 26 times event [namespace/openshift-machine-api node/ip-10-0-62-147.us-west-2.compute.internal pod/cluster-baremetal-operator-574577fbcb-z8nd4 hmsg/bf39bb17ae - Back-off restarting failed container cluster-baremetal-operator in pod cluster-baremetal-operator-574577fbcb-z8nd4_openshift-machine-api(441969c1-b430-412c-b67f-4ae2f7797f4f)] happened 51 times}
The operator recovered, and the update completed, but it's still probably worth cleaning up whatever's happening to avoid alarming anyone.
Seems like all recent CI runs that match this string touch 4.15, 4.16, or development branches:
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=junit&search=Back-off+restarting+failed+container+cluster-baremetal-operator+in+pod+cluster-baremetal-operator' | grep 'failures match' pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-upgrade-local-gateway (all) - 11 runs, 36% failed, 25% of failures match = 9% impact periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 15 runs, 20% failed, 33% of failures match = 7% impact pull-ci-openshift-kubernetes-master-e2e-aws-ovn-downgrade (all) - 3 runs, 67% failed, 50% of failures match = 33% impact periodic-ci-openshift-multiarch-master-nightly-4.16-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 15 runs, 27% failed, 25% of failures match = 7% impact periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade (all) - 32 runs, 91% failed, 7% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 40 runs, 25% failed, 20% of failures match = 5% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 3 runs, 33% failed, 100% of failures match = 33% impact pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-ovn-upgrade-out-of-change (all) - 4 runs, 25% failed, 100% of failures match = 25% impact periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade (all) - 40 runs, 8% failed, 33% of failures match = 3% impact pull-ci-openshift-azure-file-csi-driver-operator-main-e2e-azure-ovn-upgrade (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 7 runs, 43% failed, 33% of failures match = 14% impact pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade (all) - 10 runs, 30% failed, 33% of failures match = 10% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-gcp-ovn-arm64 (all) - 6 runs, 33% failed, 50% of failures match = 17% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 11 runs, 18% failed, 50% of failures match = 9% impact
Looks like ~8% impact. h2. Steps to Reproduce: 1. Run ~20 exposed job types. 2. Check for {{: [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers}} failures with {{Back-off restarting failed container cluster-baremetal-operator}} messages. h2. Actual results: ~8% impact. h2. Expected results: ~0% impact. h2. Additional info: Dropping into Loki for the run I'd picked: {code:none} {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade/1737335551998038016"} | unpack | pod="cluster-baremetal-operator-574577fbcb-z8nd4" container="cluster-baremetal-operator" |~ "220 06:0"
includes:
E1220 06:04:18.794548 1 main.go:131] "unable to create controller" err="unable to put \"baremetal\" ClusterOperator in Available state: Operation cannot be fulfilled on clusteroperators.config.openshift.io \"baremetal\": the object has been modified; please apply your changes to the latest version and try again" controller="Provisioning" I1220 06:05:40.753364 1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080" I1220 06:05:40.766200 1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks I1220 06:05:40.780426 1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform" E1220 06:05:40.795555 1 main.go:131] "unable to create controller" err="unable to put \"baremetal\" ClusterOperator in Available state: Operation cannot be fulfilled on clusteroperators.config.openshift.io \"baremetal\": the object has been modified; please apply your changes to the latest version and try again" controller="Provisioning" I1220 06:08:21.730591 1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"=":8080" I1220 06:08:21.747466 1 webhook.go:104] WebhookDependenciesReady: everything ready for webhooks I1220 06:08:21.768138 1 clusteroperator.go:217] "new CO status" reason="WaitingForProvisioningCR" processMessage="" message="Waiting for Provisioning CR on BareMetal Platform" E1220 06:08:21.781058 1 main.go:131] "unable to create controller" err="unable to put \"baremetal\" ClusterOperator in Available state: Operation cannot be fulfilled on clusteroperators.config.openshift.io \"baremetal\": the object has been modified; please apply your changes to the latest version and try again" controller="Provisioning"
So some kind of ClusterOperator-modification race?
This is a clone of issue OCPBUGS-28576. The following is the description of the original issue:
—
Description of problem:
Certificate related objects should be in certificates.hypershift.openshift.io/v1alpha1
Version-Release number of selected component (if applicable):
4.15
How reproducible:
always
Steps to Reproduce:
1. oc api-resources 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/319
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
In 4.14 RHOCP version, New feature that is Node dashboard is not showing expected metric/dashboard data. [hjaiswal@hjaiswal 4_14]$ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-26-232.ap-southeast-1.compute.internal Ready control-plane,master 6h12m v1.27.6+1648878 ip-10-0-42-100.ap-southeast-1.compute.internal Ready control-plane,master 6h12m v1.27.6+1648878 ip-10-0-46-197.ap-southeast-1.compute.internal Ready worker 6h3m v1.27.6+1648878 ip-10-0-66-225.ap-southeast-1.compute.internal NotReady worker 6h3m v1.27.6+1648878 ip-10-0-8-20.ap-southeast-1.compute.internal Ready worker 6h5m v1.27.6+1648878 ip-10-0-80-84.ap-southeast-1.compute.internal Ready control-plane,master 6h12m v1.27.6+1648878
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Steps to Reproduce:
1. Check whether all the nodes are in ready state. (cluster version 4.14) 2. ssh/debug to any worker node. 3. Stop the kubelet service. 4. check whether node went into notready state. 5. Open openshift console and goto observe--> dashboard ---> then select new feature that is "Node cluster". 6. Its showing "0" nodes in notready state but it should display "1" node in notready state.
Actual results:
In Node cluster there is no count for not ready node.
Expected results:
In Node cluster the notready node should be 1
Additional info:
Tested in AWS IPI cluster
Description of problem:
When deploying a cluster on Power VS, you need to wait for a short period after the workspace is created to facilitate the network configuration. This period is ignored by the DHCP service.
Version-Release number of selected component (if applicable):
How reproducible:
Easily
Steps to Reproduce:
1. Deploy a cluster on Power VS with an installer provisioned workspace 2. Observe that the terraform logs ignore the waiting period
Actual results:
Expected results:
Additional info:
Description of problem:
We are missing the new logic for handling v6-primary in the VSphere UPI nodeip-configuration service: https://github.com/openshift/machine-config-operator/blob/ea88304dd6de521d55a9d3413a764f618af2425a/templates/common/vsphere/units/nodeip-configuration-vsphere-upi.service.yaml#L40
https://github.com/openshift/machine-config-operator/pull/3670 addresses that, but unfortunately did not make 4.14 so we will need to backport it.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
With OCPBUGS-11099 our Pipeline Plugin supports the TektonConfig config "embedded-status: minimal" option that will be the default in OpenShift Pipelines 1.11+.
But since this change, the Pipeline pages loads the TaskRuns for any Pipeline and PipelineRun rows. To decrease the risk of a performance issue we should make this call only if the status.tasks wasn't defined.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
Actual results:
The list page load a list of TaskRuns for each Pipeline / PipelineRun also if the PipelineRun contains the related data already (status.tasks)
Expected results:
No unnecessary network calls. When the admin changes the TektonConfig config "embedded-status" option to minimal the UI should still work and load the TaskRuns as it does it today.
Additional info:
None
Description of problem:
When navigating to create Channel page from add or topology, the default name as "channel" is present but still the Create button is disabled with "Required" showing under the name field
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2023-09-26-042251
How reproducible:
Always
Steps to Reproduce:
1. Install serverless operator 2. Go to Add page in developer perspective 3. Click on the channel card
Actual results:
The create button is disabled with an error showing "Required" under the name field but the name field contains the default name as "channel"
Expected results:
The create button should be active
Additional info:
If you switch to yaml view the create button becomes active and if you switch back to form view the create button is still active
Description of problem:
When the replica for a nodepool is set to 0, the message for the nodepool is "NotFound". This message should not be displayed if the desired replica is 0.
Version-Release number of selected component (if applicable):
How reproducible:
Create a nodepool and set the replica to 0
Steps to Reproduce:
1. Create a hosted cluster 2. Set the replica for the nodepool to 0 3.
Actual results:
NodePool message is "NotFound"
Expected results:
NodePool message to be empty
Additional info:
Please review the following PR: https://github.com/openshift/cloud-provider-vsphere/pull/48
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
Installed an ipv6 disconnected agent-based hosted cluster and added 3 workers to it using the boot-it-yourself flow. When scaling down the nodepool to 2 replicas, the agent that should be unbound is stuck in unbinding-pending-user-action state:
state: unbinding-pending-user-action stateInfo: Host is waiting to be unbound from the cluster
How reproducible:
100%
Steps to reproduce:
1.
2.
3.
Actual results:
Agent stuck in unbinding-pending-user-action state
Expected results:
Agent reaches known-unbound state
This fix contains the following changes coming from updated version of kubernetes up to v1.28.6:
Changelog:
v1.28.6: https://github.com/kubernetes/kubernetes/blob/release-1.28/CHANGELOG/CHANGELOG-1.28.md#changelog-since-v1285
Description of problem:
During the destroy cluster operation, unexpected results from the IBM Cloud API calls for Disks can result in panics when response data (or responses) are missing, resulting in unexpected failures during destroy.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Unknown, dependent on IBM Cloud API responses
Steps to Reproduce:
1. Successfully create IPI cluster on IBM Cloud 2. Attempt to cleanup (destroy) the cluster
Actual results:
Golang panic attempting to parse a HTTP response that is missing or lacking data. level=info msg=Deleted instance "ci-op-97fkzvv2-e6ed7-5n5zg-master-0" E0918 18:03:44.787843 33 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 228 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x6a3d760?, 0x274b5790}) /go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xfffffffe?}) /go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75 panic({0x6a3d760, 0x274b5790}) /usr/lib/golang/src/runtime/panic.go:884 +0x213 github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).waitForDiskDeletion.func1() /go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/disk.go:84 +0x12a github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).Retry(0xc000791ce0, 0xc000573700) /go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/ibmcloud.go:99 +0x73 github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).waitForDiskDeletion(0xc000791ce0, {{0xc00160c060, 0x29}, {0xc00160c090, 0x28}, {0xc0016141f4, 0x9}, {0x82b9f0d, 0x4}, {0xc00160c060, ...}}) /go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/disk.go:78 +0x14f github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).destroyDisks(0xc000791ce0) /go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/disk.go:118 +0x485 github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).executeStageFunction.func1() /go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/ibmcloud.go:201 +0x3f k8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1({0x7f7801e503c8, 0x18}) /go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:109 +0x1b k8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext({0x227a2f78?, 0xc00013c000?}, 0xc000a9b690?) /go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:154 +0x57 k8s.io/apimachinery/pkg/util/wait.poll({0x227a2f78, 0xc00013c000}, 0xd0?, 0x146fea5?, 0x7f7801e503c8?) /go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:245 +0x38 k8s.io/apimachinery/pkg/util/wait.PollImmediateInfiniteWithContext({0x227a2f78, 0xc00013c000}, 0x4136e7?, 0x28?) /go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:229 +0x49 k8s.io/apimachinery/pkg/util/wait.PollImmediateInfinite(0x100000000000000?, 0x806f00?) /go/src/github.com/openshift/installer/vendor/k8s.io/apimachinery/pkg/util/wait/poll.go:214 +0x46 github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).executeStageFunction(0xc000791ce0, {{0x82bb9a3?, 0xc000a9b7d0?}, 0xc000111de0?}, 0x840366?, 0xc00054e900?) /go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/ibmcloud.go:198 +0x108 created by github.com/openshift/installer/pkg/destroy/ibmcloud.(*ClusterUninstaller).destroyCluster /go/src/github.com/openshift/installer/pkg/destroy/ibmcloud/ibmcloud.go:172 +0xa87 panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference
Expected results:
Destroy IBM Cloud Disks during cluster destroy, or provide a useful error message to follow up on.
Additional info:
The ability to reproduce is relatively low, as it requires the IBM Cloud API's to return specific data (or lack there of), which is currently unknown why the HTTP respoonse and/or data is missing. IBM Cloud already has a PR to attempt to mitigate this issue, like done with other destroy resource calls. Potentially followup for additional resources as necessary. https://github.com/openshift/installer/pull/7515
Please review the following PR: https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/100
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-25943. The following is the description of the original issue:
—
Description of problem:
Adding test case when exceed openshift.io/image-tags will ban to create new image references in the project
Version-Release number of selected component (if applicable):
4.16
pr - https://github.com/openshift/origin/pull/28464
This is a clone of issue OCPBUGS-24245. The following is the description of the original issue:
—
https://github.com/openshift/csi-operator/blob/master/assets/overlays/aws-ebs/base/csidriver.yaml
Missed "seLinuxMount: true" which has been merged in https://github.com/bertinatto/aws-ebs-csi-driver-operator-1/blob/0a9642cff6d2a7f9aea940ce89b65fc189cba6b6/assets/csidriver.yaml#L14
This is a clone of issue OCPBUGS-23430. The following is the description of the original issue:
—
Description of problem:
On a hybrid cluster with Windows nodes and coreOS nodes mixed, egressIP cannot be applied to coreOS anymore. QE testing profile: 53_IPI on AWS & OVN & WindowsContainer
Version-Release number of selected component (if applicable):
4.14.3
How reproducible:
Always
Steps to Reproduce:
1. Setup cluster with template aos-4_14/ipi-on-aws/versioned-installer-ovn-winc-ci 2. Label on coreOS node as egress node % oc describe node ip-10-0-59-132.us-east-2.compute.internal Name: ip-10-0-59-132.us-east-2.compute.internal Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=m6i.xlarge beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=us-east-2 failure-domain.beta.kubernetes.io/zone=us-east-2b k8s.ovn.org/egress-assignable= kubernetes.io/arch=amd64 kubernetes.io/hostname=ip-10-0-59-132.us-east-2.compute.internal kubernetes.io/os=linux node-role.kubernetes.io/worker= node.kubernetes.io/instance-type=m6i.xlarge node.openshift.io/os_id=rhcos topology.ebs.csi.aws.com/zone=us-east-2b topology.kubernetes.io/region=us-east-2 topology.kubernetes.io/zone=us-east-2b Annotations: cloud.network.openshift.io/egress-ipconfig: [{"interface":"eni-0c661bbdbb0dde54a","ifaddr":{"ipv4":"10.0.32.0/19"},"capacity":{"ipv4":14,"ipv6":15}}] csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0629862832fff4ae3"} k8s.ovn.org/host-cidrs: ["10.0.59.132/19"] k8s.ovn.org/hybrid-overlay-distributed-router-gateway-ip: 10.129.2.13 k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac: 0a:58:0a:81:02:0d k8s.ovn.org/l3-gateway-config: {"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-59-132.us-east-2.compute.internal","mac-address":"06:06:e2:7b:9c:45","ip-address... k8s.ovn.org/network-ids: {"default":"0"} k8s.ovn.org/node-chassis-id: fa1ac464-5744-40e9-96ca-6cdc74ffa9be k8s.ovn.org/node-gateway-router-lrp-ifaddr: {"ipv4":"100.64.0.7/16"} k8s.ovn.org/node-id: 7 k8s.ovn.org/node-mgmt-port-mac-address: a6:25:4e:55:55:36 k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.59.132/19"} k8s.ovn.org/node-subnets: {"default":["10.129.2.0/23"]} k8s.ovn.org/node-transit-switch-port-ifaddr: {"ipv4":"100.88.0.7/16"} k8s.ovn.org/remote-zone-migrated: ip-10-0-59-132.us-east-2.compute.internal k8s.ovn.org/zone-name: ip-10-0-59-132.us-east-2.compute.internal machine.openshift.io/machine: openshift-machine-api/wduan-debug-1120-vtxkp-worker-us-east-2b-z6wlc machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable machineconfiguration.openshift.io/currentConfig: rendered-worker-5a29871efb344f7e3a3dc51c42c21113 machineconfiguration.openshift.io/desiredConfig: rendered-worker-5a29871efb344f7e3a3dc51c42c21113 machineconfiguration.openshift.io/desiredDrain: uncordon-rendered-worker-5a29871efb344f7e3a3dc51c42c21113 machineconfiguration.openshift.io/lastAppliedDrain: uncordon-rendered-worker-5a29871efb344f7e3a3dc51c42c21113 machineconfiguration.openshift.io/lastSyncedControllerConfigResourceVersion: 22806 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Mon, 20 Nov 2023 09:46:53 +0800 Taints: <none> Unschedulable: false Lease: HolderIdentity: ip-10-0-59-132.us-east-2.compute.internal AcquireTime: <unset> RenewTime: Mon, 20 Nov 2023 14:01:05 +0800 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Mon, 20 Nov 2023 13:57:33 +0800 Mon, 20 Nov 2023 09:46:53 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Mon, 20 Nov 2023 13:57:33 +0800 Mon, 20 Nov 2023 09:46:53 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Mon, 20 Nov 2023 13:57:33 +0800 Mon, 20 Nov 2023 09:46:53 +0800 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Mon, 20 Nov 2023 13:57:33 +0800 Mon, 20 Nov 2023 09:47:34 +0800 KubeletReady kubelet is posting ready status Addresses: InternalIP: 10.0.59.132 InternalDNS: ip-10-0-59-132.us-east-2.compute.internal Hostname: ip-10-0-59-132.us-east-2.compute.internal Capacity: cpu: 4 ephemeral-storage: 125238252Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 16092956Ki pods: 250 Allocatable: cpu: 3500m ephemeral-storage: 114345831029 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 14941980Ki pods: 250 System Info: Machine ID: ec21151a2a80230ce1e1926b4f8a902c System UUID: ec21151a-2a80-230c-e1e1-926b4f8a902c Boot ID: cf4b2e39-05ad-4aea-8e53-be669b212c4f Kernel Version: 5.14.0-284.41.1.el9_2.x86_64 OS Image: Red Hat Enterprise Linux CoreOS 414.92.202311150705-0 (Plow) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.27.1-13.1.rhaos4.14.git956c5f7.el9 Kubelet Version: v1.27.6+b49f9d1 Kube-Proxy Version: v1.27.6+b49f9d1 ProviderID: aws:///us-east-2b/i-0629862832fff4ae3 Non-terminated Pods: (21 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age --------- ---- ------------ ---------- --------------- ------------- --- openshift-cluster-csi-drivers aws-ebs-csi-driver-node-tlw5h 30m (0%) 0 (0%) 150Mi (1%) 0 (0%) 4h14m openshift-cluster-node-tuning-operator tuned-4fvgv 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 4h14m openshift-dns dns-default-z89zl 60m (1%) 0 (0%) 110Mi (0%) 0 (0%) 11m openshift-dns node-resolver-v9stn 5m (0%) 0 (0%) 21Mi (0%) 0 (0%) 4h14m openshift-image-registry image-registry-67b88dc677-76hfn 100m (2%) 0 (0%) 256Mi (1%) 0 (0%) 4h14m openshift-image-registry node-ca-hw62n 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 4h14m openshift-ingress-canary ingress-canary-9r9f8 10m (0%) 0 (0%) 20Mi (0%) 0 (0%) 4h13m openshift-ingress router-default-5957f4f4c6-tl9gs 100m (2%) 0 (0%) 256Mi (1%) 0 (0%) 4h18m openshift-machine-config-operator machine-config-daemon-h7fx4 40m (1%) 0 (0%) 100Mi (0%) 0 (0%) 4h14m openshift-monitoring alertmanager-main-1 9m (0%) 0 (0%) 120Mi (0%) 0 (0%) 4h12m openshift-monitoring monitoring-plugin-68995cb674-w2wr9 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 4h13m openshift-monitoring node-exporter-kbq8z 9m (0%) 0 (0%) 47Mi (0%) 0 (0%) 4h13m openshift-monitoring prometheus-adapter-54fc7b9c87-sg4vt 1m (0%) 0 (0%) 40Mi (0%) 0 (0%) 4h13m openshift-monitoring prometheus-k8s-1 75m (2%) 0 (0%) 1104Mi (7%) 0 (0%) 4h12m openshift-monitoring prometheus-operator-admission-webhook-84b7fffcdc-x8hsz 5m (0%) 0 (0%) 30Mi (0%) 0 (0%) 4h18m openshift-monitoring thanos-querier-59cbd86d58-cjkxt 15m (0%) 0 (0%) 92Mi (0%) 0 (0%) 4h13m openshift-multus multus-7gjnt 10m (0%) 0 (0%) 65Mi (0%) 0 (0%) 4h14m openshift-multus multus-additional-cni-plugins-gn7x9 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 4h14m openshift-multus network-metrics-daemon-88tf6 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 4h14m openshift-network-diagnostics network-check-target-kpv5v 10m (0%) 0 (0%) 15Mi (0%) 0 (0%) 4h14m openshift-ovn-kubernetes ovnkube-node-74nl9 80m (2%) 0 (0%) 1630Mi (11%) 0 (0%) 3h51m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 619m (17%) 0 (0%) memory 4296Mi (29%) 0 (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) Events: <none> % oc get node -l k8s.ovn.org/egress-assignable= NAME STATUS ROLES AGE VERSION ip-10-0-59-132.us-east-2.compute.internal Ready worker 4h14m v1.27.6+b49f9d1 3. Create egressIP object
Actual results:
% oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-1 10.0.59.101 % oc get cloudprivateipconfig No resources found
Expected results:
The egressIP should be applied to egress node
Additional info:
Description of problem:
The ovs-if-br-ex.nmconnection.J1K8B2 like files breaks ovs-configuration.service. Deleting the file fixes the issue.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Bump k8s.io/client-go to v0.28.3
Using metal-ipi with okd-scos ironic fails to provision nodes
Please review the following PR: https://github.com/openshift/thanos/pull/117
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
When preparing for skipping reboot, the partition names are generated by appending "4" and "3" to the installation disk. This is not always correct. For nvme we should append "p4", and "p3"
How reproducible:
Always with nvme
Steps to reproduce:
1. Try install with nvme installation disk
2.
3.
Actual results:
The reboot is not skipped
Expected results:
The reboot should be skipped
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
With IC ovnkube-node requires namespaces/status permissions.
After talking to Tim Rozet it seems that this is not necessary, we previously used that approach because ovnkube-node only listened for local pods it needs to know this information/event from a remote gateway pod. Now since ovnkube-node is watching all pods, it can just listen for the remote pod and then sync conntrack.
This is a clone of issue OCPBUGS-26049. The following is the description of the original issue:
—
Description of problem:
Go to one pvc "VolumeSnapshots" tab, it shows error "Oh no! Something went wrong."
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-03-140457
How reproducible:
Always
Steps to Reproduce:
1.Create a pvc in project. Go to the pvc's "VolumeSnapshots" tab. 2. 3.
Actual results:
1. The error "Oh no! Something went wrong." shows up on the page.
Expected results:
1. Should show volumesnapshot related to the pvc without error.
Additional info:
screenshot: https://drive.google.com/file/d/1l0i0DCFh_q9mvFHxnftVJL0AM1LaKFOO/view?usp=sharing
This is a clone of issue OCPBUGS-30162. The following is the description of the original issue:
—
Description of problem:
Introduce --issuer-url flag in oc login .
Version-Release number of selected component (if applicable):
[xxia@2024-03-01 21:03:30 CST my]$ oc version --client Client Version: 4.16.0-0.ci-2024-03-01-033249 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 [xxia@2024-03-01 21:03:50 CST my]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.ci-2024-02-29-213249 True False 8h Cluster version is 4.16.0-0.ci-2024-02-29-213249
How reproducible:
Always
Steps to Reproduce:
1. Launch fresh HCP cluster. 2. Login to https://entra.microsoft.com. Register application and set properly. 3. Prepare variables. HC_NAME=hypershift-ci-267920 MGMT_KUBECONFIG=/home/xxia/my/env/xxia-hs416-2-267920-4.16/kubeconfig HOSTED_KUBECONFIG=/home/xxia/my/env/xxia-hs416-2-267920-4.16/hypershift-ci-267920.kubeconfig AUDIENCE=7686xxxxxx ISSUER_URL=https://login.microsoftonline.com/64dcxxxxxxxx/v2.0 CLIENT_ID=7686xxxxxx CLIENT_SECRET_VALUE="xxxxxxxx" CLIENT_SECRET_NAME=console-secret 4. Configure HC without oauthMetadata. [xxia@2024-03-01 20:29:21 CST my]$ oc create secret generic console-secret -n clusters --from-literal=clientSecret=$CLIENT_SECRET_VALUE --kubeconfig $MGMT_KUBECONFIG [xxia@2024-03-01 20:34:05 CST my]$ oc patch hc $HC_NAME -n clusters --kubeconfig $MGMT_KUBECONFIG --type=merge -p=" spec: configuration: authentication: oauthMetadata: name: '' oidcProviders: - claimMappings: groups: claim: groups prefix: 'oidc-groups-test:' username: claim: email prefixPolicy: Prefix prefix: prefixString: 'oidc-user-test:' issuer: audiences: - $AUDIENCE issuerURL: $ISSUER_URL name: microsoft-entra-id oidcClients: - clientID: $CLIENT_ID clientSecret: name: $CLIENT_SECRET_NAME componentName: console componentNamespace: openshift-console type: OIDC " Wait pods to renew: [xxia@2024-03-01 20:52:41 CST my]$ oc get po -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG --sort-by metadata.creationTimestamp ... certified-operators-catalog-7ff9cffc8f-z5dlg 1/1 Running 0 5h44m kube-apiserver-6bd9f7ccbd-kqzm7 5/5 Running 0 17m kube-apiserver-6bd9f7ccbd-p2fw7 5/5 Running 0 15m kube-apiserver-6bd9f7ccbd-fmsgl 5/5 Running 0 13m openshift-apiserver-7ffc9fd764-qgd4z 3/3 Running 0 11m openshift-apiserver-7ffc9fd764-vh6x9 3/3 Running 0 10m openshift-apiserver-7ffc9fd764-b7znk 3/3 Running 0 10m konnectivity-agent-577944765c-qxq75 1/1 Running 0 9m42s hosted-cluster-config-operator-695c5854c-dlzwh 1/1 Running 0 9m42s cluster-version-operator-7c99cf68cd-22k84 1/1 Running 0 9m42s konnectivity-agent-577944765c-kqfpq 1/1 Running 0 9m40s konnectivity-agent-577944765c-7t5ds 1/1 Running 0 9m37s 5. Check console login and oc login. $ export KUBECONFIG=$HOSTED_KUBECONFIG $ curl -ksS $(oc whoami --show-server)/.well-known/oauth-authorization-server { "issuer": "https://:0", "authorization_endpoint": "https://:0/oauth/authorize", "token_endpoint": "https://:0/oauth/token", ... } Check console login, it succeeds, console upper right shows correctly user name oidc-user-test:xxia@redhat.com. Check oc login: $ rm -rf ~/.kube/cache/oc/ $ oc login --exec-plugin=oc-oidc --client-id=$CLIENT_ID --client-secret=$CLIENT_SECRET_VALUE --extra-scopes=email --callback-port=8080 error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused Unable to connect to the server: getting credentials: exec: executable oc failed with exit code 1
Actual results:
Console login succeeds. oc login fails.
Expected results:
oc login should also succeed.
Additional info:{}
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:[link Worker CloudFormation Template|[Installing a cluster on AWS using CloudFormation templates - Installing on AWS | Installing | OpenShift Container Platform 4.13|https://docs.openshift.com/container-platform/4.13/installing/installing_aws/installing-aws-user-infra.html#installation-cloudformation-worker_installing-aws-user-infra]]
In OpenShift Documentation under Manual AWS Cloudformation Templates. Within the cloudformation template for Worker Nodes. The description for Subnet and WorkerSecurityGroupId refer to the Master Nodes. Based on the variable names the descriptions should refer to Worker Nodes instead.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
Actual results:
Expected results:
Additional info:
Problem
Hypershift requires the kas corev1.endpoint port to be exposed in the data plane hosts. This is so when resolving traffic via SVC we capture traffic in that endpoint port and we leet haproxy redirect it to the LB that resolves to KAS.
A while ago we introduced spec.metworking.apiServer.port to enable IBM to choose which port would be exposed in the data plane hosts, as using hardcode one might conflict with their env requirements.
However as we evolved the different support matrix for our endpoints publishing strategy, we mistakenly used that input as the source for other ports exposure as the internal HCP namespace SVC. We also forced overwriting the corev1.endpoint value to avoid a discrepancy with what the kas pod was generating.
Solutions
Untangle the above by:
https://github.com/openshift/hypershift/pull/2964
https://github.com/openshift/hypershift/pull/3149
https://github.com/openshift/hypershift/pull/3147
https://github.com/openshift/hypershift/pull/3185
https://github.com/openshift/hypershift/pull/3186
Description of problem:
Authenticate using the gcloud cli. The gcp credentials should no longer be using the data from osServiceAccount.json file. The installer should only allow installs to proceed when using Manual credentials mode.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Remove ~/.gcp/osServiceAccount.json 2. ensure that GOOGLE_APPLICATION_CREDENTIALS environment variable is not set. 3. gcloud auth application-default login. 4. Run the installer
Actual results:
Install succeeds
Expected results:
Install should fail noting the install mode is not Manual
Additional info:
Please review the following PR: https://github.com/openshift/router/pull/512
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
our ELB which is 10.1.235.128, however the machine host default network in another subnet. 192.168. then installation will be break with " platform.baremetal.apiVIPs: Invalid value: "10.1.235.128": IP expected to be in one of the machine networks: 192.168.90.0/24, platform.baremetal.ingressVIPs: Invalid value: "10.1.235.128": IP expected to be in one of the machine networks: 192.168.90.0/24"
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. setup cluster with loadbalncer is "usermanaged" type and apivip/ingressvip in different subnet with machine network CIDR 2. 3.
Actual results:
platform.baremetal.apiVIPs: Invalid value: "10.1.235.128": IP expected to be in one of the machine networks: 192.168.90.0/24, platform.baremetal.ingressVIPs: Invalid value: "10.1.235.128": IP expected to be in one of the machine networks: 192.168.90.0/24
Expected results:
for ELB, apivip/ingressVip may different subnet with machine network CIDR.
Additional info:
This is a clone of issue OCPBUGS-25654. The following is the description of the original issue:
—
Description of problem:
Permission related errors in capi capg and cluster-capi-operator logs
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1.Install tech preview cluster with new PRs [https://issues.redhat.com/browse/OCPCLOUD-1718] 2.Run regression suite of ClusterInfrastructure Example run - https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/ginkgo-test/219040/testReport/
Actual results:
Tests failing which are related to ccm , cpms
Expected results:
tests pass
Additional info:
Analysis of tests is done and Joel has also helped on new commits to mapi PRs to fix mapi related issues , but others repos are still wip.
Logs -
cluster capi operator errors :
[miyadav@miyadav ~]$ oc logs capi-controller-manager-74d65dd8f4-s5rlh --kubeconfig kk2 | grep -i denied [miyadav@miyadav ~]$ oc logs capi-controller-manager-74d65dd8f4-s5rlh --kubeconfig kk2 | grep -i error [miyadav@miyadav ~]$ oc logs cluster-capi-operator-66b7f99b9d-bbqxz --kubeconfig kk2 | grep -i error E1214 06:19:17.025379 1 kind.go:63] controller-runtime/source/EventHandler "msg"="if kind is a CRD, it should be installed before calling Start" "error"="failed to get restmapping: no matches for kind \"GCPCluster\" in group \"infrastructure.cluster.x-k8s.io\"" "kind"={"Group":"infrastructure.cluster.x-k8s.io","Kind":"GCPCluster"} E1214 06:19:17.025874 1 kind.go:68] controller-runtime/source/EventHandler "msg"="failed to get informer from cache" "error"="failed to get restmapping: failed to find API group \"cluster.x-k8s.io\"" E1214 06:19:17.072299 1 kind.go:63] controller-runtime/source/EventHandler "msg"="if kind is a CRD, it should be installed before calling Start" "error"="failed to get restmapping: no matches for kind \"GCPCluster\" in group \"infrastructure.cluster.x-k8s.io\"" "kind"={"Group":"infrastructure.cluster.x-k8s.io","Kind":"GCPCluster"} E1214 06:19:17.312724 1 kind.go:68] controller-runtime/source/EventHandler "msg"="failed to get informer from cache" "error"="failed to get restmapping: failed to find API group \"cluster.x-k8s.io\"" E1214 06:23:21.928322 1 leaderelection.go:327] error retrieving resource lock openshift-cluster-api/cluster-capi-operator-leader: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-cluster-api/leases/cluster-capi-operator-leader": dial tcp 172.30.0.1:443: connect: connection refused E1214 06:23:43.558393 1 controller.go:324] "msg"="Reconciler error" "error"="error during reconcile: failed to set conditions for CAPI Installer controller: Put \"https://172.30.0.1:443/apis/config.openshift.io/v1/clusteroperators/cluster-api/status\": dial tcp 172.30.0.1:443: connect: connection refused" "ClusterOperator"={"name":"cluster-api"} "controller"="clusteroperator" "controllerGroup"="config.openshift.io" "controllerKind"="ClusterOperator" "name"="cluster-api" "namespace"="" "reconcileID"="e36d1c19-dd22-4095-8d6b-50101f2bbefe" E1214 06:23:47.931676 1 leaderelection.go:327] error retrieving resource lock openshift-cluster-api/cluster-capi-operator-leader: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-cluster-api/leases/cluster-capi-operator-leader": dial tcp 172.30.0.1:443: connect: connection refused E1214 06:24:03.625555 1 controller.go:324] "msg"="Reconciler error" "error"="error during reconcile: error applying CAPI provider \"cluster-api\" components: error applying CAPI provider component \"apiextensions.k8s.io/v1/CustomResourceDefinition - clusterclasses.cluster.x-k8s.io\" at position 0: Get \"https://172.30.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/clusterclasses.cluster.x-k8s.io\": dial tcp 172.30.0.1:443: connect: connection refused\nerror applying CAPI provider component \"apiextensions.k8s.io/v1/CustomResourceDefinition - clusters.cluster.x-k8s.io\" at position 1: Get \"https://172.30.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/clusters.cluster.x-k8s.io\": dial tcp 172.30.0.1:443: connect: connection refused\nerror applying CAPI provider component \"apiextensions.k8s.io/v1/CustomResourceDefinition - machines.cluster.x-k8s.io\" at position 2: Get \"https://172.30.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/machines.cluster.x-k8s.io\": dial tcp 172.30.0.1:443: connect: connection refused\nerror applying CAPI provider component \"apiextensions.k8s.io/v1/CustomResourceDefinition - machinesets.cluster.x-k8s.io\" at position 3: Get \"https://172.30.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/machinesets.cluster.x-k8s.io\": dial tcp 172.30.0.1:443: connect: connection refused\nerror applying CAPI provider component \"apiextensions.k8s.io/v1/CustomResourceDefinition - machinedeployments.cluster.x-k8s.io\" at position 4: Get \"https://172.30.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/machinedeployments.cluster.x-k8s.io\": dial tcp 172.30.0.1:443: connect: connection refused\nerror applying CAPI provider component \"apiextensions.k8s.io/v1/CustomResourceDefinition - machinepools.cluster.x-k8s.io\" at position 5: Get \"https://172.30.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/machinepools.cluster.x-k8s.io\": dial tcp 172.30.0.1:443: connect: connection refused\nerror applying CAPI provider component \"apiextensions.k8s.io/v1/CustomResourceDefinition - clusterresourcesets.addons.cluster.x-k8s.io\" at position 6: Get \"https://172.30.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/clusterresourcesets.addons.cluster.x-k8s.io\": dial tcp 172.30.0.1:443: connect: connection refused\nerror applying CAPI provider component \"apiextensions.k8s.io/v1/CustomResourceDefinition - clusterresourcesetbindings.addons.cluster.x-k8s.io\" at position 7: Get \"https://172.30.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/clusterresourcesetbindings.addons.cluster.x-k8s.io\": dial tcp 172.30.0.1:443: connect: connection refused\nerror applying CAPI provider component \"apiextensions.k8s.io/v1/CustomResourceDefinition - machinehealthchecks.cluster.x-k8s.io\" at position 8: Get \"https://172.30.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/machinehealthchecks.cluster.x-k8s.io\": dial tcp 172.30.0.1:443: connect: connection refused\nerror applying CAPI provider component \"apiextensions.k8s.io/v1/CustomResourceDefinition - extensionconfigs.runtime.cluster.x-k8s.io\" at position 9: Get \"https://172.30.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/extensionconfigs.runtime.cluster.x-k8s.io\": dial tcp 172.30.0.1:443: connect: connection refused\nerror applying CAPI provider component \"apiextensions.k8s.io/v1/CustomResourceDefinition - ipaddresses.ipam.cluster.x-k8s.io\" at position 10: Get \"https://172.30.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/ipaddresses.ipam.cluster.x-k8s.io\": dial tcp 172.30.0.1:443: connect: connection refused\nerror applying CAPI provider component \"apiextensions.k8s.io/v1/CustomResourceDefinition - ipaddressclaims.ipam.cluster.x-k8s.io\" at position 11: Get \"https://172.30.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/ipaddressclaims.ipam.cluster.x-k8s.io\": dial tcp 172.30.0.1:443: connect: connection refused\nerror applying CAPI provider component \"rbac.authorization.k8s.io/v1/ClusterRoleBinding - capi-manager-rolebinding\" at position 12: Get \"https://172.30.0.1:443/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/capi-manager-rolebinding\": dial tcp 172.30.0.1:443: connect: connection refused\nerror applying CAPI provider component \"rbac.authorization.k8s.io/v1/ClusterRole - capi-manager-role\" at position 13: Get \"https://172.30.0.1:443/apis/rbac.authorization.k8s.io/v1/clusterroles/capi-manager-role\": dial tcp 172.30.0.1:443: connect: connection refused" "ClusterOperator"={"name":"cluster-api"} "controller"="clusteroperator" "controllerGroup"="config.openshift.io" "controllerKind"="ClusterOperator" "name"="cluster-api" "namespace"="" "reconcileID"="973b6337-9db3-4543-aa4f-e417b016e32f" E1214 06:25:58.205862 1 leaderelection.go:327] error retrieving resource lock openshift-cluster-api/cluster-capi-operator-leader: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-cluster-api/leases/cluster-capi-operator-leader": dial tcp 172.30.0.1:443: connect: connection refused E1214 06:29:53.798600 1 leaderelection.go:327] error retrieving resource lock openshift-cluster-api/cluster-capi-operator-leader: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-cluster-api/leases/cluster-capi-operator-leader": dial tcp 172.30.0.1:443: connect: connection refused E1214 06:33:20.139517 1 leaderelection.go:327] error retrieving resource lock openshift-cluster-api/cluster-capi-operator-leader: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-cluster-api/leases/cluster-capi-operator-leader": dial tcp 172.30.0.1:443: connect: connection refused E1214 06:34:16.142400 1 leaderelection.go:327] error retrieving resource lock openshift-cluster-api/cluster-capi-operator-leader: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-cluster-api/leases/cluster-capi-operator-leader": dial tcp 172.30.0.1:443: i/o timeout E1214 06:45:15.546142 1 kubeconfig.go:81] KubeconfigController "msg"="Error reconciling kubeconfig" "error"="error generating kubeconfig: token can't be empty" "Secret"={"name":"cluster-capi-operator-secret","namespace":"openshift-cluster-api"} "controller"="secret" "controllerGroup"="" "controllerKind"="Secret" "name"="cluster-capi-operator-secret" "namespace"="openshift-cluster-api" "reconcileID"="910273fa-6f22-4326-a330-a235be2c6cc4" E1214 06:45:15.560795 1 controller.go:324] "msg"="Reconciler error" "error"="error generating kubeconfig: token can't be empty" "Secret"={"name":"cluster-capi-operator-secret","namespace":"openshift-cluster-api"} "controller"="secret" "controllerGroup"="" "controllerKind"="Secret" "name"="cluster-capi-operator-secret" "namespace"="openshift-cluster-api" "reconcileID"="910273fa-6f22-4326-a330-a235be2c6cc4" E1214 06:45:15.567938 1 kubeconfig.go:81] KubeconfigController "msg"="Error reconciling kubeconfig" "error"="error generating kubeconfig: token can't be empty" "Secret"={"name":"cluster-capi-operator-secret","namespace":"openshift-cluster-api"} "controller"="secret" "controllerGroup"="" "controllerKind"="Secret" "name"="cluster-capi-operator-secret" "namespace"="openshift-cluster-api" "reconcileID"="d6e13dc5-9b90-42f3-bcbd-c451bf4359a9"
capg errors
[miyadav@miyadav ~]$ oc logs capg-controller-manager-6b54798bb9-x6vxk --kubeconfig kk2 | grep -i denied E1214 07:26:10.892932 1 reconcile.go:152] "msg"="Error creating an instance" "error"="googleapi: Error 400: SERVICE_ACCOUNT_ACCESS_DENIED - The user does not have access to service account 'miyadav-1412v3-28f9k-w@openshift-qe.iam.gserviceaccount.com'. User: 'miyadav-1412-openshift-c-v5vsh@openshift-qe.iam.gserviceaccount.com'. Ask a project owner to grant you the iam.serviceAccountUser role on the service account" "GCPMachine"={"name":"gcp-machinetemplate-6pgrk","namespace":"openshift-cluster-api"} "controller"="gcpmachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="GCPMachine" "name"="gcp-machinetemplate-6pgrk" "namespace"="openshift-cluster-api" "reconcileID"="1cca1651-62b0-4939-b1fb-f7006dbef4eb" "zone"="us-central1-b" E1214 07:26:10.892988 1 gcpmachine_controller.go:229] "msg"="Error reconciling instance resources" "error"="googleapi: Error 400: SERVICE_ACCOUNT_ACCESS_DENIED - The user does not have access to service account 'miyadav-1412v3-28f9k-w@openshift-qe.iam.gserviceaccount.com'. User: 'miyadav-1412-openshift-c-v5vsh@openshift-qe.iam.gserviceaccount.com'. Ask a project owner to grant you the iam.serviceAccountUser role on the service account" "GCPMachine"={"name":"gcp-machinetemplate-6pgrk","namespace":"openshift-cluster-api"} "controller"="gcpmachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="GCPMachine" "name"="gcp-machinetemplate-6pgrk" "namespace"="openshift-cluster-api" "reconcileID"="1cca1651-62b0-4939-b1fb-f7006dbef4eb" E1214 07:26:10.911565 1 controller.go:324] "msg"="Reconciler error" "error"="googleapi: Error 400: SERVICE_ACCOUNT_ACCESS_DENIED - The user does not have access to service account 'miyadav-1412v3-28f9k-w@openshift-qe.iam.gserviceaccount.com'. User: 'miyadav-1412-openshift-c-v5vsh@openshift-qe.iam.gserviceaccount.com'. Ask a project owner to grant you the iam.serviceAccountUser role on the service account" "GCPMachine"={"name":"gcp-machinetemplate-6pgrk","namespace":"openshift-cluster-api"} "controller"="gcpmachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="GCPMachine" "name"="gcp-machinetemplate-6pgrk" "namespace"="openshift-cluster-api" "reconcileID"="1cca1651-62b0-4939-b1fb-f7006dbef4eb"
Description of problem:
Version-Release number of selected component (if applicable):
4.14
How reproducible:
1. oc patch svc <svc> --type merge --patch '{"spec":{"sessionAffinity":"ClientIP"}}'
2. curl <svc>:<port>
3. oc scale --replicas=3 deploy/<deploy>
4. oc scale --replicas=0 deploy/<deploy>
5. oc scale --replicas=3 deploy/<deploy>
Actual results:
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 54850
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 46668
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 46682
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 60144
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 60150
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 60160
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 51720
Expected results:
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46914
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46928
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46944
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 40510
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 40520
Additional info:
See the hostname in the server log output for each command.
$ oc patch svc <svc> --type merge --patch '{"spec":{"sessionAffinity":"ClientIP"}}'
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46914
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46928
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 46944
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 40510
Hostname: tcp-668655b888-9mmfc, SourceIP: 10.128.1.205, SourcePort: 40520
$ oc scale --replicas=1 deploy/<deploy>
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 47082
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 47088
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 54832
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 54848
$ oc scale --replicas=3 deploy/<deploy>
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 54850
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 46668
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 46682
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 60144
Hostname: tcp-668655b888-dcwlr, SourceIP: 10.128.1.205, SourcePort: 60150
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 60160
Hostname: tcp-668655b888-xxcn2, SourceIP: 10.128.1.205, SourcePort: 51720
Description of problem:
oc process command fails while running it with a template file
Version-Release number of selected component (if applicable):
4.12.41
How reproducible:
100%
Steps to Reproduce:
1. Create a new project and a template file $ oc new-project test $ oc get template httpd-example -n openshift -o yaml > /tmp/template_http.yaml 2. Run oc process command as given below $ oc process -f /tmp/template_http.yaml error: unable to process template: the namespace of the provided object does not match the namespace sent on the request 3. When we run this command as a template from other namespace it runs fine. $ oc process openshift//httpd-example 4. $ oc version Client Version: 4.12.41 Kustomize Version: v4.5.7 Server Version: 4.12.42 Kubernetes Version: v1.25.14+bcb9a60
Actual results:
$ oc process -f /tmp/template_http.yaml error: unable to process template: the namespace of the provided object does not match the namespace sent on the request
Expected results:
Command should display the output of resources it will create
Additional info:
This is a clone of issue OCPBUGS-27817. The following is the description of the original issue:
—
Description of problem:
When performing upgrades on ROSA HCP clusters with a large number of worker nodes (> 51), the Kube APIServer pods of the cluster use up memory exceeding the capacity of their nodes, resulting in OOMKills.
Version-Release number of selected component (if applicable):
4.14, 4.15
How reproducible:
always
Steps to Reproduce:
1. Create ROSA HCP Cluster 2. Add 100 workers to Cluster 3. Upgrade the cluster
Actual results:
Kube APIServer pods are OOMKilled
Expected results:
Upgrade completes successfully
Additional info:
Description of problem:
add missing vulnerabilities column and Signed icon in PAC repository PLR list. Same as what we have in PipelineRuns list page
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Easily
Steps to Reproduce:
1. Deploy in wdc with 4.15 2. Observe that workers don't launch 3. Installer fails
Actual results:
worker nodes will not launch
Expected results:
install completes
Additional info:
Description of problem:
Catalog pods in hypershift control plane in ImagePullBackOff
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Create a cluster in 4.14 HO + OCP 4.14.0-0.ci-2023-09-07-120503 2. Check controlplane pods, catalog pods in control plane namespace in ImagePullBackOff 3.
Actual results:
jiezhao-mac:hypershift jiezhao$ oc get pods -n clusters-jie-test | grep catalog catalog-operator-64fd787d9c-98wx5 2/2 Running 0 2m43s certified-operators-catalog-7766fc5b8-4s66z 0/1 ImagePullBackOff 0 2m43s community-operators-catalog-847cdbff6-wsf74 0/1 ImagePullBackOff 0 2m43s redhat-marketplace-catalog-fccc6bbb5-2d5x4 0/1 ImagePullBackOff 0 2m43s redhat-operators-catalog-86b6f66d5d-mpdsc 0/1 ImagePullBackOff 0 2m43s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 65m default-scheduler Successfully assigned clusters-jie-test/certified-operators-catalog-7766fc5b8-4s66z to ip-10-0-64-135.us-east-2.compute.internal Normal AddedInterface 65m multus Add eth0 [10.128.2.141/23] from openshift-sdn Normal Pulling 63m (x4 over 65m) kubelet Pulling image "from:imagestream" Warning Failed 63m (x4 over 65m) kubelet Failed to pull image "from:imagestream": rpc error: code = Unknown desc = reading manifest imagestream in docker.io/library/from: requested access to the resource is denied Warning Failed 63m (x4 over 65m) kubelet Error: ErrImagePull Warning Failed 63m (x6 over 65m) kubelet Error: ImagePullBackOff Normal BackOff 9s (x280 over 65m) kubelet Back-off pulling image "from:imagestream" jiezhao-mac:hypershift jiezhao$
Expected results:
catalog pods are running
Additional info:
slack: https://redhat-internal.slack.com/archives/C01C8502FMM/p1694170060144859
Please review the following PR: https://github.com/openshift/installer/pull/7819
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-openshift-apiserver-operator/pull/560
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
kdump crash logs are not created to the SSH remote when OVN is configured.
See https://issues.redhat.com/browse/OCPBUGS-28239
Description of problem:
When we deploy a cluster in AWS using this template https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blob/master/functionality-testing/aos-4_14/ipi-on-aws/versioned-installer-customer_vpc-disconnected_private_cluster-sts-private-s3-custom_endpoints-ci master MCP is degraded and reports this error: - lastTransitionTime: "2023-04-25T07:48:45Z" message: 'Node ip-10-0-55-111.us-east-2.compute.internal is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-8ef3f9cb45adb7bbe5f819eb831ffd7d\" not found", Node ip-10-0-60-138.us-east-2.compute.internal is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-8ef3f9cb45adb7bbe5f819eb831ffd7d\" not found", Node ip-10-0-69-137.us-east-2.compute.internal is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-8ef3f9cb45adb7bbe5f819eb831ffd7d\" not found"' reason: 3 nodes are reporting degraded status on sync status: "True" type: NodeDegraded
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False False 3h12m Error while reconciling 4.14.0-0.nightly-2023-04-19-125337: the cluster operator machine-config is degraded
How reproducible:
2 out of 2.
Steps to Reproduce:
1. Install OCP using this template https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blob/master/functionality-testing/aos-4_14/ipi-on-aws/versioned-installer-customer_vpc-disconnected_private_cluster-sts-private-s3-custom_endpoints-ci We can see examples of this installation here: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/198964/ and here: https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/199028/ Builds have been marked as keep forever, but just in case, the parameters are: INSTANCE_NAME_PREFIX: Your ID, any short string just make it sure it is unit. VARIABLES_LOCATION: private-templates/functionality-testing/aos-4_14/ipi-on-aws/versioned-installer-customer_vpc-disconnected_private_cluster-sts-private-s3-custom_endpoints-ci LAUNCHER_VARS: <leave empty> BUSHSLICER_CONFIG:<leave emtpy>
Actual results:
The installation failed reporting a degrade master MCP $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False False 3h12m Error while reconciling 4.14.0-0.nightly-2023-04-19-125337: the cluster operator machine-config is degraded $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master False True True 3 0 0 3 4h21m worker rendered-worker-166729d2617b1b63cf5d9bb818dd9cf8 True False False 3 3 3 0 4h21m
Expected results:
Installation should finish without problems and no MCP should be degraded
Additional info:
Must gather linked in the first comment
Description of problem:
Found auto case OCP-42340 failed in ci job which version is 4.14.0-ec.4 and then reproduced issue in 4.14.0-0.nightly-2023-08-22-221456
Version-Release number of selected component (if applicable):
4.14.0-ec.4 4.14.0-0.nightly-2023-08-22-221456
How reproducible:
Always
Steps to Reproduce:
1. Deploy egressrouter on baremetal with { "kind": "List", "apiVersion": "v1", "metadata": {}, "items": [ { "apiVersion": "network.operator.openshift.io/v1", "kind": "EgressRouter", "metadata": { "name": "egressrouter-42430", "namespace": "e2e-test-networking-egressrouter-l4xgx" }, "spec": { "addresses": [ { "gateway": "192.168.111.1", "ip": "192.168.111.55/24" } ], "mode": "Redirect", "networkInterface": { "macvlan": { "mode": "Bridge" } }, "redirect": { "redirectRules": [ { "destinationIP": "142.250.188.206", "port": 80, "protocol": "TCP" }, { "destinationIP": "142.250.188.206", "port": 8080, "protocol": "TCP", "targetPort": 80 }, { "destinationIP": "142.250.188.206", "port": 8888, "protocol": "TCP", "targetPort": 80 } ] } } } ] } % oc get pods -n e2e-test-networking-egressrouter-l4xgx -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES egress-router-cni-deployment-c4bff88cf-skv9j 1/1 Running 0 69m 10.131.0.26 worker-0 <none> <none> 2. Create service which point to egressrouter % oc get svc -n e2e-test-networking-egressrouter-l4xgx -o yaml apiVersion: v1 items: - apiVersion: v1 kind: Service metadata: creationTimestamp: "2023-08-23T05:58:30Z" name: ovn-egressrouter-multidst-svc namespace: e2e-test-networking-egressrouter-l4xgx resourceVersion: "50383" uid: 07341ff1-6df3-40a6-b27e-59102d56e9c1 spec: clusterIP: 172.30.10.103 clusterIPs: - 172.30.10.103 internalTrafficPolicy: Cluster ipFamilies: - IPv4 ipFamilyPolicy: SingleStack ports: - name: con1 port: 80 protocol: TCP targetPort: 80 - name: con2 port: 5000 protocol: TCP targetPort: 8080 - name: con3 port: 6000 protocol: TCP targetPort: 8888 selector: app: egress-router-cni sessionAffinity: None type: ClusterIP status: loadBalancer: {} kind: List metadata: resourceVersion: "" 3. create a test pod to access the service or curl the egressrouter IP:port directly oc rsh -n e2e-test-networking-egressrouter-l4xgx hello-pod1 ~ $ curl 172.30.10.103:80 --connect-timeout 5 curl: (28) Connection timeout after 5001 ms ~ $ curl 10.131.0.26:80 --connect-timeout 5 curl: (28) Connection timeout after 5001 ms $ curl 10.131.0.26:8080 --connect-timeout 5 curl: (28) Connection timeout after 5001 ms
Actual results:
connection failed
Expected results:
connection succeed
Additional info:
Note, the issue didn't exist in 4.13. It passed in 4.13 latest nightly build 4.13.0-0.nightly-2023-08-11-101506
08-23 15:26:16.955 passed: (1m3s) 2023-08-23T07:26:07 "[sig-networking] SDN ConnectedOnly-Author:huirwang-High-42340-Egress router redirect mode with multiple destinations."
The agent-interactive-console service is required by both sshd and systemd-logind, so if it exits with an error code there is no way to connect or log in to the box to debug.
Description of problem:
When the installer generates a CPMS, it should only add the `failureDomains` field when there is more than one failure domain. When there is only one failure domain, the fields from the failure domain, eg the zone, should be injected directly into the provider spec and the failure domain should be omitted. By doing this, we avoid having to care about failure domain injection logic for single zone clusters. Potentially avoiding bugs (such as some we have seen recently). IIRC we already did this for OpenStack, but AWS, Azure and GCP may not be affected.
Version-Release number of selected component (if applicable):
How reproducible:
Can be demonstrated on Azure on the westus region which has no AZs available. Currently the installer creates the following, which we can omit entirely: ``` failureDomains: platform: Azure azure: - zone: "" ```
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-26977. The following is the description of the original issue:
—
Description of problem:
When using a custom CNI plugin in a hostedcluster, multus requires some CSRs to be approved. The component approving these CSRs is the network-node-identity. This component only gets the proper RBAC rules configured when networkType is set to Calico. In the current implementation, there is an condition that will apply the required RBAC if the networkType is set to Calico[1]. When using other CNI plugins, like Cilium, you're supposed to set networkType to Other. With current implementation, you won't get the required RBAC in place and as such, the required CSRs won't be approved automatically. [1] https://github.com/openshift/hypershift/blob/release-4.14/control-plane-operator/controllers/hostedcontrolplane/cno/clusternetworkoperator.go#L139
Version-Release number of selected component (if applicable):
Latest
How reproducible:
Always
Steps to Reproduce:
1. Set hostedcluster.spec.networking.networkType to Other 2. Wait for the HC to start deploying and for the Nodes to join the cluster 3. The nodes will remain in NotReady. Multus pods will complaing about certificates not being ready. 4. If you list CSRs you will find pending CSRs.
Actual results:
RBAC not properly configured when networkType set to Other
Expected results:
RBAC properly configured when networkType set to Other
Additional info:
Slack discussion: https://redhat-internal.slack.com/archives/C01C8502FMM/p1704824277049609
A change to the installConfig in 4.12 means a user can now specify both an IPv4 and IPv6 address for the API and/or Ingress VIPs when running dual-stack on the baremetal or vsphere platforms. (Previously, only an IPv4 VIP could be used on dual-stack clusters.)
Once the assisted-service and ZTP support this, we'll want to allow passing that information through.
This needs to wait until 4.12 branches, which should be June 24 per https://lists.corp.redhat.com/archives/aos-hive/2022-April/000006.html
This is a clone of issue OCPBUGS-24408. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When running 4.15 installer full function test, detect below one arm64 instance families and verified, need to append them in installer doc[1]: - standardBpsv2Family [1] https://github.com/openshift/installer/blob/master/docs/user/azure/tested_instance_types_aarch64.md
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-28607. The following is the description of the original issue:
—
Description of problem:
HyperShift-managed components use the default RevisionHistoryLimit of 10. This significantly impacts etcd load and scalability on the management cluster.
Version-Release number of selected component (if applicable):
4.9, 4.10, 4.11, 4.12, 4.13, 4.14, 4.15, 4.16
How reproducible:
100% (may vary depending on resource availablility on management cluster)
Steps to Reproduce:
1. Create 375+ HostedCluster 2. Observe etcd performance on management cluster 3.
Actual results:
etcd hitting storage space limits
Expected results:
Able to manage HyperShift control planes at scale (375+ HostedClusters)
Additional info:
Description of problem:
The test case https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-50926 was created for NE-577 epic. When we increase the 'spec.tuningOptions.maxConnections' to 200000, the default ingress controller stuck in progressing.
Version-Release number of selected component (if applicable):
How reproducible:
https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-50926
Steps to Reproduce:
1.Edit the defualt controller with max value 2000000oc -n openshift-ingress-operator edit ingresscontroller defaulttuningOptions: maxConnections: 2000000 2.melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator get ingresscontroller default -o yaml | grep -A1 tuningOptions tuningOptions: maxConnections: 2000000 3. melvinjoseph@mjoseph-mac openshift-tests-private % oc get co/ingress NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE ingress 4.15.0-0.nightly-2023-10-16-231617 True True False 3h42m ingresscontroller "default" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 1 old replica(s) are pending termination......
Actual results:
The default ingress controller stuck in progressing
Expected results:
The ingress controller should work as normal
Additional info:
melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress get po NAME READY STATUS RESTARTS AGE router-default-7cf67f448-gb7mr 0/1 Running 0 38s router-default-7cf67f448-qmvks 0/1 Running 0 38s router-default-7dcd556587-kvk8d 0/1 Terminating 0 3h53m router-default-7dcd556587-vppk4 1/1 Running 0 3h53m melvinjoseph@mjoseph-mac openshift-tests-private % melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress get po NAME READY STATUS RESTARTS AGE router-default-7cf67f448-gb7mr 0/1 Running 0 111s router-default-7cf67f448-qmvks 0/1 Running 0 111s router-default-7dcd556587-vppk4 1/1 Running 0 3h55m melvinjoseph@mjoseph-mac openshift-tests-private % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.15.0-0.nightly-2023-10-16-231617 True False False 3h28m baremetal 4.15.0-0.nightly-2023-10-16-231617 True False False 3h55m cloud-controller-manager 4.15.0-0.nightly-2023-10-16-231617 True False False 3h58m cloud-credential 4.15.0-0.nightly-2023-10-16-231617 True False False 3h59m cluster-autoscaler 4.15.0-0.nightly-2023-10-16-231617 True False False 3h55m config-operator 4.15.0-0.nightly-2023-10-16-231617 True False False 3h56m console 4.15.0-0.nightly-2023-10-16-231617 True False False 3h34m control-plane-machine-set 4.15.0-0.nightly-2023-10-16-231617 True False False 3h43m csi-snapshot-controller 4.15.0-0.nightly-2023-10-16-231617 True False False 3h39m dns 4.15.0-0.nightly-2023-10-16-231617 True False False 3h54m etcd 4.15.0-0.nightly-2023-10-16-231617 True False False 3h47m image-registry 4.15.0-0.nightly-2023-10-16-231617 True False False 176m ingress 4.15.0-0.nightly-2023-10-16-231617 True True False 3h39m ingresscontroller "default" is progressing: IngressControllerProgressing: One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 1 old replica(s) are pending termination...... insights 4.15.0-0.nightly-2023-10-16-231617 True False False 3h49m kube-apiserver 4.15.0-0.nightly-2023-10-16-231617 True False False 3h45m kube-controller-manager 4.15.0-0.nightly-2023-10-16-231617 True False False 3h46m kube-scheduler 4.15.0-0.nightly-2023-10-16-231617 True False False 3h46m kube-storage-version-migrator 4.15.0-0.nightly-2023-10-16-231617 True False False 3h56m machine-api 4.15.0-0.nightly-2023-10-16-231617 True False False 3h45m machine-approver 4.15.0-0.nightly-2023-10-16-231617 True False False 3h55m machine-config 4.15.0-0.nightly-2023-10-16-231617 True False False 3h53m marketplace 4.15.0-0.nightly-2023-10-16-231617 True False False 3h55m monitoring 4.15.0-0.nightly-2023-10-16-231617 True False False 3h35m network 4.15.0-0.nightly-2023-10-16-231617 True False False 3h57m node-tuning 4.15.0-0.nightly-2023-10-16-231617 True False False 3h39m openshift-apiserver 4.15.0-0.nightly-2023-10-16-231617 True False False 3h43m openshift-controller-manager 4.15.0-0.nightly-2023-10-16-231617 True False False 3h39m openshift-samples 4.15.0-0.nightly-2023-10-16-231617 True False False 3h39m operator-lifecycle-manager 4.15.0-0.nightly-2023-10-16-231617 True False False 3h54m operator-lifecycle-manager-catalog 4.15.0-0.nightly-2023-10-16-231617 True False False 3h54m operator-lifecycle-manager-packageserver 4.15.0-0.nightly-2023-10-16-231617 True False False 3h43m service-ca 4.15.0-0.nightly-2023-10-16-231617 True False False 3h56m storage 4.15.0-0.nightly-2023-10-16-231617 True False False 3h36m melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator get po NAME READY STATUS RESTARTS AGE ingress-operator-c6fd989fd-jsrzv 2/2 Running 4 (3h45m ago) 3h58m melvinjoseph@mjoseph-mac openshift-tests-private % melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator logs ingress-operator-c6fd989fd-jsrzv -c ingress-operator --tail=20 2023-10-17T11:34:54.327Z INFO operator.ingress_controller handler/enqueue_mapped.go:81 queueing ingress {"name": "default", "related": ""} 2023-10-17T11:34:54.348Z INFO operator.ingress_controller handler/enqueue_mapped.go:81 queueing ingress {"name": "default", "related": ""} 2023-10-17T11:34:54.348Z INFO operator.ingress_controller handler/enqueue_mapped.go:81 queueing ingress {"name": "default", "related": ""} 2023-10-17T11:34:54.394Z INFO operator.ingressclass_controller controller/controller.go:118 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:34:54.394Z INFO operator.route_metrics_controller controller/controller.go:118 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:34:54.394Z INFO operator.status_controller controller/controller.go:118 Reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:34:54.397Z INFO operator.ingress_controller controller/controller.go:118 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:34:54.429Z INFO operator.status_controller controller/controller.go:118 Reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:34:54.446Z INFO operator.certificate_controller controller/controller.go:118 Reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:34:54.553Z INFO operator.ingressclass_controller controller/controller.go:118 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:34:54.553Z INFO operator.route_metrics_controller controller/controller.go:118 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:34:54.553Z INFO operator.status_controller controller/controller.go:118 Reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:34:54.557Z ERROR operator.ingress_controller controller/controller.go:118 got retryable error; requeueing {"after": "59m59.9999758s", "error": "IngressController may become degraded soon: DeploymentReplicasAllAvailable=False"} 2023-10-17T11:34:54.558Z INFO operator.ingress_controller controller/controller.go:118 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:34:54.583Z INFO operator.status_controller controller/controller.go:118 Reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:34:54.657Z ERROR operator.ingress_controller controller/controller.go:118 got retryable error; requeueing {"after": "59m59.345629987s", "error": "IngressController may become degraded soon: DeploymentReplicasAllAvailable=False"} 2023-10-17T11:34:54.794Z INFO operator.certificate_controller controller/controller.go:118 Reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:36:11.151Z INFO operator.ingress_controller handler/enqueue_mapped.go:81 queueing ingress {"name": "default", "related": ""} 2023-10-17T11:36:11.151Z INFO operator.ingress_controller controller/controller.go:118 reconciling {"request": {"name":"default","namespace":"openshift-ingress-operator"}} 2023-10-17T11:36:11.248Z ERROR operator.ingress_controller controller/controller.go:118 got retryable error; requeueing {"after": "58m42.755479533s", "error": "IngressController may become degraded soon: DeploymentReplicasAllAvailable=False"} melvinjoseph@mjoseph-mac openshift-tests-private % melvinjoseph@mjoseph-mac openshift-tests-private % oc get po -n openshift-ingress NAME READY STATUS RESTARTS AGE router-default-7cf67f448-gb7mr 0/1 Running 1 (71s ago) 3m57s router-default-7cf67f448-qmvks 0/1 Running 1 (70s ago) 3m57s router-default-7dcd556587-vppk4 1/1 Running 0 3h57m melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress logs router-default-7cf67f448-gb7mr --tail=20 I1017 11:39:22.623928 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:23.623924 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:24.623373 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:25.627359 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:26.623337 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:27.623603 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:28.623866 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:29.623183 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:30.623475 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:31.623949 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure melvinjoseph@mjoseph-mac openshift-tests-private % melvinjoseph@mjoseph-mac openshift-tests-private % melvinjoseph@mjoseph-mac openshift-tests-private % melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress logs router-default-7cf67f448-qmvks --tail=20 I1017 11:39:34.553475 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:35.551412 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:36.551421 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure E1017 11:39:37.052068 1 haproxy.go:418] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory I1017 11:39:37.551648 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:38.551632 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:39.551410 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:40.552620 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:41.552050 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:42.551076 1 healthz.go:261] backend-http check failed: healthz [-]backend-http failed: backend reported failure I1017 11:39:42.564293 1 template.go:828] router "msg"="Shutdown requested, waiting 45s for new connections to cease" melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator get ingresscontroller NAME AGE default 3h59m melvinjoseph@mjoseph-mac openshift-tests-private % oc -n openshift-ingress-operator get ingresscontroller default -o yaml apiVersion: operator.openshift.io/v1 <-----snip----> status: availableReplicas: 1 conditions: - lastTransitionTime: "2023-10-17T07:41:42Z" reason: Valid status: "True" type: Admitted - lastTransitionTime: "2023-10-17T07:57:01Z" message: The deployment has Available status condition set to True reason: DeploymentAvailable status: "True" type: DeploymentAvailable - lastTransitionTime: "2023-10-17T07:57:01Z" message: Minimum replicas requirement is met reason: DeploymentMinimumReplicasMet status: "True" type: DeploymentReplicasMinAvailable - lastTransitionTime: "2023-10-17T11:34:54Z" message: 1/2 of replicas are available reason: DeploymentReplicasNotAvailable status: "False" type: DeploymentReplicasAllAvailable - lastTransitionTime: "2023-10-17T11:34:54Z" message: | Waiting for router deployment rollout to finish: 1 old replica(s) are pending termination... reason: DeploymentRollingOut status: "True" type: DeploymentRollingOut - lastTransitionTime: "2023-10-17T07:41:43Z" message: The endpoint publishing strategy supports a managed load balancer reason: WantedByEndpointPublishingStrategy status: "True" type: LoadBalancerManaged - lastTransitionTime: "2023-10-17T07:57:24Z" message: The LoadBalancer service is provisioned reason: LoadBalancerProvisioned status: "True" type: LoadBalancerReady - lastTransitionTime: "2023-10-17T07:41:43Z" message: LoadBalancer is not progressing reason: LoadBalancerNotProgressing status: "False" type: LoadBalancerProgressing - lastTransitionTime: "2023-10-17T07:41:43Z" message: DNS management is supported and zones are specified in the cluster DNS config. reason: Normal status: "True" type: DNSManaged - lastTransitionTime: "2023-10-17T07:57:26Z" message: The record is provisioned in all reported zones. reason: NoFailedZones status: "True" type: DNSReady - lastTransitionTime: "2023-10-17T07:57:26Z" status: "True" type: Available - lastTransitionTime: "2023-10-17T11:34:54Z" message: |- One or more status conditions indicate progressing: DeploymentRollingOut=True (DeploymentRollingOut: Waiting for router deployment rollout to finish: 1 old replica(s) are pending termination... ) reason: IngressControllerProgressing status: "True" type: Progressing - lastTransitionTime: "2023-10-17T07:57:28Z" status: "False" type: Degraded - lastTransitionTime: "2023-10-17T07:41:43Z" <-----snip---->
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/260
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/images/pull/150
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/106
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
When trying to create cluster with s390x architecture, an error occurs that stops cluster creation. The error is "cannot use Skip MCO reboot because it's not compatible with the s390x architecture on version 4.15.0-ec.3 of OpenShift"
How reproducible:
Always
Steps to reproduce:
Create cluster with architecture s390x
Actual results:
Create failed
Expected results:
Create should succeed
Hit seemingly every job in the last payload:
Credentials request shows:
"conditions": [ { "lastProbeTime": "2023-11-26T11:20:40Z", "lastTransitionTime": "2023-11-26T11:20:40Z", "message": "failed to grant creds: error syncing creds in mint-mode: error creating custom role: rpc error: code = ResourceExhausted desc = Maximum number of roles reached. Maximum is: 300\nerror details: retry in 24h0m1s", "reason": "CredentialsProvisionFailure", "status": "True", "type": "CredentialsProvisionFailure" } ],
We've heard a new gcp account is live, but we're not sure if these are landing in it or not. Perhaps they are and a limit needs to be bumped?
This issue shows up as a Cluster Version Operator component readiness regression due to failing the following tests:
Looking at recent CI metal-ipi CI jobs
Some of the boot strap failures seem to be because of master nodes failing to come up
Search https://search.dptools.openshift.org/?search=Got+0+worker+nodes%2C+%5B12%5D+master+nodes%2C&maxAge=336h&context=-1&type=build-log&name=metal-ipi&excludeName=&maxMatches=1&maxBytes=20971520&groupBy=none
43 results over the last 14 days
level=error msg=ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 1 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods).
Currently, the Topology Feature is enabled by default by the openstack-cinder-csi-driver-operator. As seen in OCPBUGS-4697, this is problematic in environments where there is a mismatch between Nova and Cinder AZs, such as in DCN environments where there may be multiple nova AZs but only a single cinder AZ. On initial read, the [BlockStorage] ignore-volume-az would appear to offer a way out, but as I noted in OCPBUGS-4697 and upstream, this doesn't actually do what you'd think it does.
We should allow the user to configure this functionality via an operator-level configurable. We may wish to go one step further and also attempt to auto-detect the correct value by inspecting the available Nova and Cinder AZs. The latter step would require OpenStack API access from the operator, but both services do provide non-admin APIs to retrieve this information.
We also explored other options:
All our tunnel traffic, whether GENEVE or VXLAN, should skip conntrack in the host network namespace because it's pointless to track it. It's UDP and it's point-to-point; there are no connections to care about.
We already skip the GENEVE traffic in OVN-K and the VXLAN traffic in SDN, but we aren't skipping the VXLAN traffic that Hybrid Overlay and ICNIv1 generate.
CNO's ovnkube-node YAML should add a couple lines to, if Hybrid Overlay is enabled, -j NOTRACK for .OVNHybridOverlayVXLANPort. Note that .OVNHybridOverlayVXLANPort will be empty if the default VXLAN port is used, so we'd need a bit of if/else logical to -j NOTRACK the default port if .OVNHybridOverlayVXLANPort is empty.
Please review the following PR: https://github.com/openshift/coredns/pull/106
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
alertmanager-trusted-ca-bundle, prometheus-trusted-ca-bundle, telemeter-trusted-ca-bundle, thanos-querier-trusted-ca-bundle are empty on the hosted cluster. This results in CMO not creating the prometheus CR, resulting in no prometheus pods. This issue prevents us from monitoring the hosted cluster.
Version-Release number of selected component (if applicable):
4.13.z
How reproducible:
Rare: Found only one occurence for now.
Steps to Reproduce:
1. 2. 3.
Actual results:
Certs are not created, prometheus doesn't create prometheus pods
Expected results:
Certs are created and CMO can create prometheus pods
Additional info:
Linked Must Gather of the MC, inspect of the openshift-monitoring DP namespace
In order to avoid possible issues with SDN during migration from SDN to OVNK, do not use port 9106 for ovnkube-control-plane metrics, since it's already used by SDN. Use a port that is not used by SDN, such as 9108.
Please review the following PR: https://github.com/openshift/oauth-proxy/pull/265
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-24299. The following is the description of the original issue:
—
CNO managed component (network-node-identity) to conform to hypershift control plane expectations that All secrets should be mounted to not have global read. change from 420(0644) to 416(0640)
Some operators failed to install
Multicluster engine (MCE) failed to install. Due to this, the cluster will be degraded, but you can try to install the operator from the Operator Hub. Please check the installation log for more information.
OpenShift version 4.14.0-rc.4
While installed successfully on OpenShift version 4.13.13
Steps to reproduce:
1. Create cluster on AI SaaS version OCP 4.14.0-rc.4
2. Select MCE operator
3. Continue settings and start installaiton
Actual results:
Cluster installed but
Operators
Multicluster engine failed
Expected results:
Operators
Multicluster engine installed
Please review the following PR: https://github.com/openshift/builder/pull/357
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-29453. The following is the description of the original issue:
—
Description of problem:
The etcd team has introduced an e2e test that exercises a full etcd backup and restore cycle in OCP [1]. We run those tests as part of our PR builds and since 4.15 [2] (also 4.16 [3]), we have failed runs with the catalogd-controller-manager crash looping: 1 events happened too frequently event [namespace/openshift-catalogd node/ip-10-0-25-29.us-west-2.compute.internal pod/catalogd-controller-manager-768bb57cdb-nwbhr hmsg/47b381d71b - Back-off restarting failed container manager in pod catalogd-controller-manager-768bb57cdb-nwbhr_openshift-catalogd(aa38d084-ecb7-4588-bd75-f95adb4f5636)] happened 44 times} I assume something in that controller doesn't really deal gracefully with the restoration process of etcd, or the apiserver being down for some time. [1] https://github.com/openshift/origin/blob/master/test/extended/dr/recovery.go#L97 [2] https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1205/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-etcd-recovery/1757443629380538368 [3] https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1191/pull-ci-openshift-cluster-etcd-operator-release-4.15-e2e-aws-etcd-recovery/1752293248543494144
Version-Release number of selected component (if applicable):
> 4.15
How reproducible:
always by running the test
Steps to Reproduce:
Run the test: [sig-etcd][Feature:DisasterRecovery][Suite:openshift/etcd/recovery][Timeout:2h] [Feature:EtcdRecovery][Disruptive] Recover with snapshot with two unhealthy nodes and lost quorum [Serial] and observe the event invariant failing on it crash looping
Actual results:
catalogd-controller-manager crash loops and causes our CI jobs to fail
Expected results:
our e2e job is green again and catalogd-controller-manager doesn't crash loop
Additional info:
Please review the following PR: https://github.com/openshift/operator-framework-operator-controller/pull/26
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/azure-disk-csi-driver-operator/pull/98
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/hypershift/pull/3017
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-kube-scheduler-operator/pull/493
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-26486. The following is the description of the original issue:
—
Description of problem:
The following test started to fail freequently in the periodic tests: External Storage [Driver: pd.csi.storage.gke.io] [Testpattern: Dynamic PV (block volmode)] provisioning should provision storage with pvc data source in parallel
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Sometimes, but way too often in the CI
Steps to Reproduce:
1. Run the periodic-ci-openshift-release-master-nightly-X.X-e2e-gcp-ovn-csi test
Actual results:
Provisioning of some volumes fails with time="2024-01-05T02:30:07Z" level=info msg="resulting interval message" message="{ProvisioningFailed failed to provision volume with StorageClass \"e2e-provisioning-9385-e2e-scw2z8q\": rpc error: code = Internal desc = CreateVolume failed to create single zonal disk pvc-35b558d6-60f0-40b1-9cb7-c6bdfa9f28e7: failed to insert zonal disk: unknown Insert disk operation error: rpc error: code = Internal desc = operation operation-1704421794626-60e299f9dba08-89033abf-3046917a failed (RESOURCE_OPERATION_RATE_EXCEEDED): Operation rate exceeded for resource 'projects/XXXXXXXXXXXXXXXXXXXXXXXX/zones/us-central1-a/disks/pvc-501347a5-7d6f-4a32-b0e0-cf7a896f316d'. Too frequent operations from the source resource. map[reason:ProvisioningFailed]}"
Expected results:
Test passes
Additional info:
Looks like we're hitting the API quota limits with the test
Failed test run example:
Link to Sippy:
Description of problem:
IBM VPC CSI Driver failed to provisioning volume in proxy cluster, (if I understand correctly) it seems the proxy in not injected because in our definition (https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/master/assets/controller.yaml), we are injecting proxy to csi-driver: config.openshift.io/inject-proxy: csi-driver config.openshift.io/inject-proxy-cabundle: csi-driver but the container name is iks-vpc-block-driver in https://github.com/openshift/ibm-vpc-block-csi-driver-operator/blob/master/assets/controller.yaml#L153 I checked the proxy in not defined in controller pod or driver container ENV.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-08-11-055332
How reproducible:
Always
Steps to Reproduce:
1. Create IBM cluster with proxy setting 2. create pvc/pod with IBM VPC CSI Driver
Actual results:
It failed to provisioning volume
Expected results:
Provisioning volume works well on proxy cluster
Additional info:
DISCLAIMER: The code for measuring disruption in-cluster is extremely new, we cannot be 100% confident what we're seeing is real, however the below bug is demonstrating a problem that is occurring in a very specific configuration, all others are unaffected, so this helps us gain some confidence what we're seeing is real.
The total disruption comes from a number of pods which are added together, the actual duration of the disruption is roughly / 14. The actual disruption appears to be about 12 minutes and hits all pods doing pod-to-host monitoring simultaneously.
Sample job: (taken from expanding the "Most Recent Runs" panel in grafana)
In the first spyglass chart for upgrade you can see the batch of disruption: 7:28:19 - 7:40:03
We do not have data prior to ovn interconnect landing, so we cannot say if this started at that time or not.
When moving the controller, the existing wasn't removed.
Description of problem:
IPI or UPI installing a private cluster on GCP always fail, with the cluster operator ingress telling LoadBalancerPending and CanaryChecksRepetitiveFailures
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-11-07-233748
How reproducible:
Always
Steps to Reproduce:
1. create a private cluster on GCP, either IPI or UPI
Actual results:
The installation failed, with ingress operator degraded.
Expected results:
The installation can succeed.
Additional info:
Some PROW CI tests: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.15-arm64-nightly-gcp-ipi-private-f28-longduration-cloud/1722352860160593920 (Must-gather https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.15-arm64-nightly-gcp-ipi-private-f28-longduration-cloud/1722352860160593920/artifacts/gcp-ipi-private-f28-longduration-cloud/gather-must-gather/artifacts/must-gather.tar) https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-gcp-ipi-xpn-private-f28/1722176483704705024 https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-gcp-ipi-private-fips-f6-disasterrecovery/1722066338567950336 FYI QE Flexy-install jobs: IPI Flexy-install/245364/, UPI Flexy-install/245524/ $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False True 14h Unable to apply 4.15.0-0.nightly-2023-11-07-233748: some cluster operators are not available $ oc get nodes NAME STATUS ROLES AGE VERSION jiwei-1108-priv-kx7b4-master-0.c.openshift-qe.internal Ready control-plane,master 14h v1.28.3+4cbdd29 jiwei-1108-priv-kx7b4-master-1.c.openshift-qe.internal Ready control-plane,master 14h v1.28.3+4cbdd29 jiwei-1108-priv-kx7b4-master-2.c.openshift-qe.internal Ready control-plane,master 14h v1.28.3+4cbdd29 jiwei-1108-priv-kx7b4-worker-a-l28pl.c.openshift-qe.internal Ready worker 14h v1.28.3+4cbdd29 jiwei-1108-priv-kx7b4-worker-b-84bx5.c.openshift-qe.internal Ready worker 14h v1.28.3+4cbdd29 $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.15.0-0.nightly-2023-11-07-233748 False False True 14h OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.jiwei-1108-priv.qe.gcp.devcluster.openshift.com/healthz": dial tcp: lookup oauth-openshift.apps.jiwei-1108-priv.qe.gcp.devcluster.openshift.com on 172.30.0.10:53: no such host (this is likely result of malfunctioning DNS server) baremetal 4.15.0-0.nightly-2023-11-07-233748 True False False 14h cloud-controller-manager 4.15.0-0.nightly-2023-11-07-233748 True False False 14h cloud-credential 4.15.0-0.nightly-2023-11-07-233748 True False False 14h cluster-autoscaler 4.15.0-0.nightly-2023-11-07-233748 True False False 14h config-operator 4.15.0-0.nightly-2023-11-07-233748 True False False 14h console 4.15.0-0.nightly-2023-11-07-233748 False True False 14h DeploymentAvailable: 0 replicas available for console deployment... control-plane-machine-set 4.15.0-0.nightly-2023-11-07-233748 True False False 14h csi-snapshot-controller 4.15.0-0.nightly-2023-11-07-233748 True False False 14h dns 4.15.0-0.nightly-2023-11-07-233748 True False False 14h etcd 4.15.0-0.nightly-2023-11-07-233748 True False False 14h image-registry 4.15.0-0.nightly-2023-11-07-233748 True False False 14h ingress False True True 7h37m The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (LoadBalancerPending: The LoadBalancer service is pending) insights 4.15.0-0.nightly-2023-11-07-233748 True False False 14h kube-apiserver 4.15.0-0.nightly-2023-11-07-233748 True False False 14h kube-controller-manager 4.15.0-0.nightly-2023-11-07-233748 True False False 14h kube-scheduler 4.15.0-0.nightly-2023-11-07-233748 True False False 14h kube-storage-version-migrator 4.15.0-0.nightly-2023-11-07-233748 True False False 14h machine-api 4.15.0-0.nightly-2023-11-07-233748 True False False 14h machine-approver 4.15.0-0.nightly-2023-11-07-233748 True False False 14h machine-config 4.15.0-0.nightly-2023-11-07-233748 True False False 14h marketplace 4.15.0-0.nightly-2023-11-07-233748 True False False 14h monitoring 4.15.0-0.nightly-2023-11-07-233748 True False False 14h network 4.15.0-0.nightly-2023-11-07-233748 True False False 14h node-tuning 4.15.0-0.nightly-2023-11-07-233748 True False False 14h openshift-apiserver 4.15.0-0.nightly-2023-11-07-233748 True False False 14h openshift-controller-manager 4.15.0-0.nightly-2023-11-07-233748 True False False 14h openshift-samples 4.15.0-0.nightly-2023-11-07-233748 True False False 14h operator-lifecycle-manager 4.15.0-0.nightly-2023-11-07-233748 True False False 14h operator-lifecycle-manager-catalog 4.15.0-0.nightly-2023-11-07-233748 True False False 14h operator-lifecycle-manager-packageserver 4.15.0-0.nightly-2023-11-07-233748 True False False 14h service-ca 4.15.0-0.nightly-2023-11-07-233748 True False False 14h storage 4.15.0-0.nightly-2023-11-07-233748 True False False 14h $ oc describe co ingress Name: ingress Namespace: Labels: <none> Annotations: include.release.openshift.io/ibm-cloud-managed: true include.release.openshift.io/self-managed-high-availability: true include.release.openshift.io/single-node-developer: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2023-11-08T10:38:15Z Generation: 1 Owner References: API Version: config.openshift.io/v1 Controller: true Kind: ClusterVersion Name: version UID: dbaae892-1b6d-480d-a201-0549d0a3149d Resource Version: 172514 UID: 3922a9fe-584f-458f-ac4f-b62b4842758e Spec: Status: Conditions: Last Transition Time: 2023-11-08T17:49:01Z Message: The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (LoadBalancerPending: The LoadBalancer service is pending) Reason: IngressUnavailable Status: False Type: Available Last Transition Time: 2023-11-08T11:02:27Z Message: Not all ingress controllers are available. Reason: Reconciling Status: True Type: Progressing Last Transition Time: 2023-11-08T17:51:01Z Message: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: LoadBalancerReady=False (LoadBalancerPending: The LoadBalancer service is pending), CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing) Reason: IngressDegraded Status: True Type: Degraded Last Transition Time: 2023-11-08T10:52:36Z Reason: IngressControllersUpgradeable Status: True Type: Upgradeable Last Transition Time: 2023-11-08T10:52:36Z Reason: AsExpected Status: False Type: EvaluationConditionsDetected Extension: <nil> Related Objects: Group: Name: openshift-ingress-operator Resource: namespaces Group: operator.openshift.io Name: Namespace: openshift-ingress-operator Resource: ingresscontrollers Group: ingress.operator.openshift.io Name: Namespace: openshift-ingress-operator Resource: dnsrecords Group: Name: openshift-ingress Resource: namespaces Group: Name: openshift-ingress-canary Resource: namespaces Events: <none> $ oc get pods -n openshift-ingress-operator -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ingress-operator-57c555c75b-gqbk6 2/2 Running 2 (14h ago) 14h 10.129.0.36 jiwei-1108-priv-kx7b4-master-1.c.openshift-qe.internal <none> <none> $ oc -n openshift-ingress-operator logs ingress-operator-57c555c75b-gqbk6 ...output omitted... 2023-11-08T10:56:53.715Z ERROR operator.ingress_controller controller/controller.go:118 got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded: DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1: Some pods are not scheduled: Pod \"router-default-7c86c4f4b5-jsljz\" cannot be scheduled: 0/3 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.. Pod \"router-default-7c86c4f4b5-pltz4\" cannot be scheduled: 0/3 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.. Make sure you have sufficient worker nodes.), LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: INSTANCE_IN_MULTIPLE_LOAD_BALANCED_IGS - Validation failed for instance 'projects/openshift-qe/zones/us-central1-a/instances/jiwei-1108-priv-kx7b4-master-0': instance may belong to at most one load-balanced instance group.\nThe kube-controller-manager logs may contain more details.)"} ...output omitted... 2023-11-08T15:13:41.323Z ERROR operator.ingress_controller controller/controller.go:118 got retryable error; requeueing {"after": "1m0s", "error": "IngressController is degraded: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: googleapi: Error 400: Resource 'projects/openshift-qe/zones/us-central1-b/instances/jiwei-1108-priv-kx7b4-worker-b-84bx5' is expected to be in the subnetwork 'projects/openshift-qe/regions/us-central1/subnetworks/jiwei-1108-priv-master-subnet' but is in the subnetwork 'projects/openshift-qe/regions/us-central1/subnetworks/jiwei-1108-priv-worker-subnet'., wrongSubnetwork\nThe kube-controller-manager logs may contain more details.), CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)"} ...output omitted... $ Must-gather https://drive.google.com/file/d/1zwhJ4ga0-tQuRorha4XnUGUKbSTx1fx4/view?usp=drive_link
Description of the problem:
The reboot that happens after writing the RHCOS image to the disk fails with 4.15-ec.2 on KVM s390.
How reproducible:
I am not able to reproduced in the qemu s390x emulator. But Amadeus Podvratnik had the issue in real hardware.
Steps to reproduce:
1. Use assisted installer with version 4.15-ec.2 to install to a logical partition.
Actual results:
The installer writes the RHCOS image to the disk, but then fails to boot from it. Instead it boots to the emergency shell and writes this errors to the console:
Nov 27 12:49:49 localhost ostree-prepare-root[1130]: ostree-prepare-root: Couldnn 't find specified OSTree root '/sysroot//ostree/boot.1/rhcos/452f29cc74e701f4f3ff 69e66657fe28788d6c490aa0032c138909b7b2ce429c7/0': No such file or directory Nov 27 12:49:49 localhost systemd[1]: ostree-prepare-root.service: Main process exited, code=exited, status=1/FAILURE Nov 27 12:49:49 localhost systemd[1]: ostree-prepare-root.service: Failed with rr esult 'exit-code'. Nov 27 12:49:49 localhost systemd[1]: Failed to start OSTree Prepare OS/.
Expected results:
Should boot and continue the installation.
If the user specifies baselineCapabilitySet: None in the install-config and does not specifically enable the capability baremetal, yet still uses platform: baremetal then the install will reliably fail.
This failure takes the form of a timeout with the bootkube logs (not easily accessible to the user) full of errors like:
bootkube.sh[46065]: "99_baremetal-provisioning-config.yaml": unable to get REST mapping for "99_baremetal-provisioning-config.yaml": no matches for kind "Provisioning" in version "metal3.io/v1alpha1" bootkube.sh[46065]: "99_openshift-cluster-api_hosts-0.yaml": unable to get REST mapping for "99_openshift-cluster-api_hosts-0.yaml": no matches for kind "BareMetalHost" in version "metal3.io/v1alpha1"
Since the installer can tell when processing the install-config if the baremetal capability is missing, we should detect this and error out immediately to save the user an hour of their life and us a support case.
Although this was found on an agent install, I believe the same will apply to a baremetal IPI install.
Description of problem:
It's blocking the Prow CI test: https://github.com/openshift/release/pull/42822#issuecomment-1760704535
[cloud-user@preserve-olm-env2 jian]$ oc image extract registry.ci.openshift.org/ocp/4.15:cli --path /usr/bin/oc:. --confirm [cloud-user@preserve-olm-env2 jian]$ sudo chmod 777 oc [cloud-user@preserve-olm-env2 jian]$ [cloud-user@preserve-olm-env2 jian]$ ./oc version Client Version: v4.2.0-alpha.0-2030-g0307852 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 [cloud-user@preserve-olm-env2 jian]$ ./oc image mirror --insecure=true --skip-missing=true --skip-verification=true --keep-manifest-list=true --filter-by-os='.*' quay.io/openshifttest/ociimage:multiarch localhost:5000/olmqe/ociimage3:multiarch localhost:5000/ olmqe/ociimage3 error: the manifest type *ocischema.DeserializedImageIndex is not supported manifests: sha256:d58e3e003ddec723dd14f72164beaa609d24c5e5e366579e23bc8b34b9a58324 -> multiarch stats: shared=0 unique=0 size=0B error: the manifest type *ocischema.DeserializedImageIndex is not supported error: an error occurred during planning
Version-Release number of selected component (if applicable):
The master branch of https://github.com/openshift/oc : https://github.com/openshift/oc/commit/03078525c97d612c2070081d0e9f322f946360f4
[cloud-user@preserve-olm-env2 jian]$ podman inspect registry.ci.openshift.org/ocp/4.15:cli [ { "Id": "feac27a180964dff0a0ff0a9fcdb593fcf87a7d80177e6c79ab804fb8477f55b", "Digest": "sha256:8fcc83d3c72c66867c38456a217298239d99626d96012dbece5c669e3ad5952c", "RepoTags": [ "registry.ci.openshift.org/ocp/4.15:cli" ], "RepoDigests": [ "registry.ci.openshift.org/ocp/4.15@sha256:8fcc83d3c72c66867c38456a217298239d99626d96012dbece5c669e3ad5952c", "registry.ci.openshift.org/ocp/4.15@sha256:cf4f54e2f20af19afe3c5c0685aa95ab3296d177204b01a3d8bfddf7c3d45f49" ], ... "summary": "Provides the latest release of the Red Hat Extended Life Base Image.", "url": "https://access.redhat.com/containers/#/registry.access.redhat.com/openshift/ose-base/images/v4.15.0-202310111407.p0.g16dbf5e.assembly.stream", "vcs-ref": "03078525c97d612c2070081d0e9f322f946360f4", "vcs-type": "git", "vcs-url": "https://github.com/openshift/oc", "vendor": "Red Hat, Inc.", "version": "v4.15.0" ... "created": "2023-10-12T23:06:08.279786979Z", "created_by": "/bin/sh -c #(nop) LABEL \"io.openshift.build.name\"=\"cli-amd64\" \"io.openshift.build.namespace\"=\"ci-op-37527gwf\" \"io.openshift.build.commit.author\"=\"\" \"io.openshift.build.commit.date\"=\"\" \"io.openshift.build.commit.id\"=\"03078525c97d612c2070081d0e9f322f946360f4\" \"io.openshift.build.commit.message\"=\"\" \"io.openshift.build.commit.ref\"=\"master\" \"io.openshift.build.name\"=\"\" \"io.openshift.build.namespace\"=\"\" \"io.openshift.build.source-context-dir\"=\"\" \"io.openshift.build.source-location\"=\"https://github.com/openshift/oc\" \"io.openshift.ci.from.base\"=\"sha256:d7a2588527405101eeb1578a0e97e465ec83b0b927b71cf689703554e81cb585\" \"vcs-ref\"=\"03078525c97d612c2070081d0e9f322f946360f4\" \"vcs-type\"=\"git\" \"vcs-url\"=\"https://github.com/openshift/oc\"",
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
4.15.0-0.nightly-2023-10-09-101435(the `oc` commits 1bbfec243e5910a5a86df985489700c3d3137aed) works well.
[cloud-user@preserve-olm-env2 client]$ ./oc version Client Version: 4.15.0-0.nightly-2023-10-09-101435 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 [cloud-user@preserve-olm-env2 client]$ ./oc image mirror --insecure=true --skip-missing=true --skip-verification=true --keep-manifest-list=true --filter-by-os='.*' quay.io/openshifttest/ociimage:multiarch localhost:5000/olmqe/ociimage2:multiarch2 localhost:5000/ olmqe/ociimage2 ... sha256:d58e3e003ddec723dd14f72164beaa609d24c5e5e366579e23bc8b34b9a58324 localhost:5000/olmqe/ociimage2:multiarch2 info: Mirroring completed in 2.47s (72.87MB/s) [cloud-user@preserve-olm-env2 oc]$ oc adm release info registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2023-10-09-101435 --commits |grep oc Pull From: registry.ci.openshift.org/ocp/release@sha256:b5d1f88597d49d0e34ed4acfe3149817d02774d4c0661cbcb0c04896d1a852c6 ... tools https://github.com/openshift/oc 1bbfec243e5910a5a86df985489700c3d3137aed
Please review the following PR: https://github.com/openshift/machine-api-operator/pull/1179
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
CCO supports creating a credentials request in manual mode to specify the fields required to perform short term authentication using workload identity federation but the console fields and warnings that are supposed to be present are not.
Version-Release number of selected component (if applicable):
How reproducible:
create a catalog containing a bundle that has the annotation to support WIF and apply it to an oidc manual azure cluster.
Steps to Reproduce:
1. 2. 3.
Actual results:
No warnings or additional field options for subscription are present
Expected results:
Warnings and additional fields for subscription should be present
Additional info:
This is a clone of issue OCPBUGS-23550. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
apbexternalroute and egressfirewall status shows empty on hypershift hosted cluster
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-17-173511
How reproducible:
always
Steps to Reproduce:
1. setup hypershift, login hosted cluster % oc get node NAME STATUS ROLES AGE VERSION ip-10-0-128-55.us-east-2.compute.internal Ready worker 125m v1.28.4+7aa0a74 ip-10-0-129-197.us-east-2.compute.internal Ready worker 125m v1.28.4+7aa0a74 ip-10-0-135-106.us-east-2.compute.internal Ready worker 125m v1.28.4+7aa0a74 ip-10-0-140-89.us-east-2.compute.internal Ready worker 125m v1.28.4+7aa0a74 2. create new project test % oc new-project test 3. create apbexternalroute and egressfirewall on hosted cluster apbexternalroute yaml file: --- apiVersion: k8s.ovn.org/v1 kind: AdminPolicyBasedExternalRoute metadata: name: apbex-route-policy spec: from: namespaceSelector: matchLabels: kubernetes.io/metadata.name: test nextHops: static: - ip: "172.18.0.8" - ip: "172.18.0.9" % oc apply -f apbexroute.yaml adminpolicybasedexternalroute.k8s.ovn.org/apbex-route-policy created egressfirewall yaml file: --- apiVersion: k8s.ovn.org/v1 kind: EgressFirewall metadata: name: default spec: egress: - type: Allow to: cidrSelector: 0.0.0.0/0 % oc apply -f egressfw.yaml egressfirewall.k8s.ovn.org/default created 3. oc get apbexternalroute and oc get egressfirewall
Actual results:
The status show empty: % oc get apbexternalroute NAME LAST UPDATE STATUS apbex-route-policy 49s <--- status is empty % oc describe apbexternalroute apbex-route-policy | tail -n 8 Status: Last Transition Time: 2023-12-19T06:54:17Z Messages: ip-10-0-135-106.us-east-2.compute.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9 ip-10-0-129-197.us-east-2.compute.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9 ip-10-0-128-55.us-east-2.compute.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9 ip-10-0-140-89.us-east-2.compute.internal: configured external gateway IPs: 172.18.0.8,172.18.0.9 Events: <none> % oc get egressfirewall NAME EGRESSFIREWALL STATUS default <--- status is empty % oc describe egressfirewall default | tail -n 8 Type: Allow Status: Messages: ip-10-0-129-197.us-east-2.compute.internal: EgressFirewall Rules applied ip-10-0-128-55.us-east-2.compute.internal: EgressFirewall Rules applied ip-10-0-140-89.us-east-2.compute.internal: EgressFirewall Rules applied ip-10-0-135-106.us-east-2.compute.internal: EgressFirewall Rules applied Events: <none>
Expected results:
the status can be shown correctly
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Description of problem:
During the testing of NE1264 epic, i configured both syslog and container destination type of logging on the same default ingress controller. In the ingress controller spec we can see, it is taking both the destination type, but it is not reflect in ROUTER_LOG_MAX_LENGTH env or the haproxy.config file melvinjoseph@mjoseph-mac Downloads % oc -n openshift-ingress-operator get ingresscontroller/default -oyaml apiVersion: operator.openshift.io/v1 kind: IngressController <-----snip---> spec: clientTLS: clientCA: name: "" clientCertificatePolicy: "" httpCompression: {} httpEmptyRequestsPolicy: Respond httpErrorCodePages: name: "" logging: access: destination: container: maxLength: 1024 syslog: address: 1.2.3.4 maxLength: 1024 port: 514 type: Container logEmptyRequests: Log replicas: 2 tuningOptions: reloadInterval: 0s unsupportedConfigOverrides: null melvinjoseph@mjoseph-mac Downloads % oc -n openshift-ingress exec router-default-6c86ff75d9-g24q5 -- env | grep ROUTER_LOG_MAX_LENGTH Defaulted container "router" out of: router, logs ROUTER_LOG_MAX_LENGTH=1024 melvinjoseph@mjoseph-mac Downloads % oc -n openshift-ingress exec router-default-6c86ff75d9-l9rjv -- cat haproxy.config | grep 1024 Defaulted container "router" out of: router, logs log /var/lib/rsyslog/rsyslog.sock len 1024 local1 info when we patch changes to log length, it is not reflect as expected for one destination. melvinjoseph@mjoseph-mac Downloads % oc -n openshift-ingress-operator patch ingresscontroller/default -p '{"spec":{"logging":{"access":{"destination":{"container":{"maxLength":480}}}}}}' --type=merge ingresscontroller.operator.openshift.io/default patched melvinjoseph@mjoseph-mac Downloads % oc -n openshift-ingress exec router-default-6476d6c69d-tlhqd -- env | grep ROUTER_LOG_MAX_LENGTH Defaulted container "router" out of: router, logs ROUTER_LOG_MAX_LENGTH=480 melvinjoseph@mjoseph-mac Downloads % oc -n openshift-ingress-operator patch ingresscontroller/default -p '{"spec":{"logging":{"access":{"destination":{"syslog":{"maxLength":4096}}}}}}' --type=merge ingresscontroller.operator.openshift.io/default patched melvinjoseph@mjoseph-mac Downloads % oc -n openshift-ingress-operator get ingresscontroller/default -oyaml apiVersion: operator.openshift.io/v1 kind: IngressController <----snip----> spec: clientTLS: clientCA: name: "" clientCertificatePolicy: "" httpCompression: {} httpEmptyRequestsPolicy: Respond httpErrorCodePages: name: "" logging: access: destination: container: maxLength: 480 syslog: address: 1.2.3.4 maxLength: 4096 port: 514 type: Container logEmptyRequests: Log replicas: 2 tuningOptions: reloadInterval: 0s unsupportedConfigOverrides: null melvinjoseph@mjoseph-mac Downloads % oc -n openshift-ingress exec router-default-59cf55666d-shq98 -- env | grep ROUTER_LOG_MAX_LENGTH Defaulted container "router" out of: router, logs ROUTER_LOG_MAX_LENGTH=480 In another round of testing i can see only the syslog destination type is reflecting on env and not the container destination type. I am also not sure whether it is a valid situation where we can use both type of destination type on default ingress controller.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Steps to Reproduce:
1. Edit the default ingress controller and add both destination type configs 2. 3.
Actual results:
Either one type value is only reflecting in the haproxy.config file
Expected results:
Both type should we reflected
Additional info:
Description of problem:
CMPS was supported in 4.15 on vsphere platform when enable TechPreviewNoUpgrade. but after I build the cluster with no failure domains/single failure domain setting in install-config. there were three duplicated failure domains.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-11-033133
How reproducible:
install a cluster with TP enabled and don't set failure domain (or set single failure doamin) in install-config.
Steps to Reproduce:
1. do not config failure domain in install-config (or set single failure doamin). 2. install a cluster with TP enabled 3. check CPMS with command: oc get controlplanemachineset -oyaml
Actual results:
duplicated failure domains. failureDomains: platform: VSphere vsphere: - name: generated-failure-domain - name: generated-failure-domain - name: generated-failure-domain metadata: labels:
Expected results:
failure domain should not duplicated when setting single failure domain in install-config. failure domain should not exists when not setting failure domain in install-config.
Additional info:
Please review the following PR: https://github.com/openshift/cloud-provider-nutanix/pull/19
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/86
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/564
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Bump KSM to the latest v2.10.1 release that addresses a regression in the previous upstream release as well as builds with a newer Golang patch version (v1.20.8).
Description of problem:
Install private cluster by using azure workload identity, and failed due to no worker machines being provisioned. install-config: ---------------------- platform: azure: region: eastus networkResourceGroupName: jima971b-12015319-rg virtualNetwork: jima971b-vnet controlPlaneSubnet: jima971b-master-subnet computeSubnet: jima971b-worker-subnet resourceGroupName: jima971b-rg publish: Internal credentialsMode: Manual Detailed check on cluster and found machine-api/ingress/image-registry operators reported permissions issues and have no access to customer vnet. $ oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE jima971b-qqjb7-master-0 Running Standard_D8s_v3 eastus 2 5h14m jima971b-qqjb7-master-1 Running Standard_D8s_v3 eastus 3 5h14m jima971b-qqjb7-master-2 Running Standard_D8s_v3 eastus 1 5h15m jima971b-qqjb7-worker-eastus1-mtc47 Failed 4h52m jima971b-qqjb7-worker-eastus2-ph8bk Failed 4h52m jima971b-qqjb7-worker-eastus3-hpmvj Failed 4h52m Errors on worker machine: -------------------- errorMessage: 'failed to reconcile machine "jima971b-qqjb7-worker-eastus1-mtc47": network.SubnetsClient#Get: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code="AuthorizationFailed" Message="The client ''705eb743-7c91-4a16-a7cf-97164edc0341'' with object id ''705eb743-7c91-4a16-a7cf-97164edc0341'' does not have authorization to perform action ''Microsoft.Network/virtualNetworks/subnets/read'' over scope ''/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima971b-12015319-rg/providers/Microsoft.Network/virtualNetworks/jima971b-vnet/subnets/jima971b-worker-subnet'' or the scope is invalid. If access was recently granted, please refresh your credentials."' errorReason: InvalidConfiguration After manually creating customer role with missed permissions for machine-api/ingress/cloud-controller-manager/image-registry, and assigning it to machine-api/ingress/cloud-controller-manager/image-registry user-assigned identity on scope of customer vnet, cluster was recovered and became running. Permissions for machine-api/cloud-controller-manager/ingress on customer vnet: "Microsoft.Network/virtualNetworks/subnets/read", "Microsoft.Network/virtualNetworks/subnets/join/action" Permissions for image-registry on customer vnet: "Microsoft.Network/virtualNetworks/subnets/read", "Microsoft.Network/virtualNetworks/subnets/join/action" "Microsoft.Network/virtualNetworks/join/action"
Version-Release number of selected component (if applicable):
4.15 nightly build
How reproducible:
always on recent 4.15 payload
Steps to Reproduce:
1. prepare install-config with private cluster configuration + credentialsMode: Manual 2. using ccoctl tool to create workload identity 3. install cluster
Actual results:
Installation failed due to permission issues
Expected results:
ccoctl also needs to assign customer role to machine-api/ccm/image-registry user-assigned identity on scope of customer vnet if it is configured in install-config
Additional info:
Issue is only detected on 4.15, it works on 4.14.
Description of problem:
Bootstrap process fails. When attempting to gather logs, the process fails. The SSH connection was refused.
Version-Release number of selected component (if applicable):
How reproducible:
Alsways when failing bootstrap process
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Cluster with 3 masters and 100 workers installed succefully,
Attempt to download installation logs failed - nothing happened
Error raised in Debugger console:
Access to XMLHttpRequest at 'https://api.stage.openshift.com/api/assisted-install/v2/clusters/c7d60db0-2997-4380-813d-b504134e9920/downloads/files-presigned?file_name=logs&logs_type=all' from origin 'https://qaprodauth.console.redhat.com' has been blocked by CORS policy: No 'Access-Control-Allow-Origin' header is present on the requested resource.
src_bootstrap_tsx-src_moduleOverrides_unfetch_ts-webpack_sharing_consume_default_patternfly_r-31174d.fcbb79a89748b2f6.js:22320
GET https://api.stage.openshift.com/api/assisted-install/v2/clusters/c7d60db0-2997-4380-813d-b504134e9920/downloads/files-presigned?file_name=logs&logs_type=all
net::ERR_FAILED 504 (Gateway Timeout)
It happened on browsers:
Chrome 117.0.5938.92
Firefox 117.0.1 (64-bit)
See attached screenshots and logs from Assisted Service pod
I can successfully download installation logs from other clusters using the same browsers.
Steps to reproduce:
1. Install cluster with 103 nodes
2. Try download installation logs
Actual results:
Nothing happened and error raised
Expected results:
Should download installation logs
Description of problem:
Critical Alert Rules do not have runbook url
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
This bug is being raised by Openshift Monitoring team as part of effort to detect invalid Alert Rules in OCP. 1. Check details of MultipleDefaultStorageClasses Alert Rule 2. 3.
Actual results:
The Alert Rule MultipleDefaultStorageClasses has Critical Severity, but does not have runbook_url annotation.
Expected results:
All Critical Alert Rules must have runbbok_url annotation
Additional info:
Critical Alerts must have a runbook, please refer to style guide at https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide The runbooks are located at github.com/openshift/runbooks To resolve the bug, - Add runbooks for the relevant Alerts at github.com/openshift/runbooks - Add the link to the runbook in the Alert annotation 'runbook_url' - Remove the exception in the origin test, added in PR https://github.com/openshift/origin/pull/27933
Please review the following PR: https://github.com/openshift/cluster-api-provider-alibaba/pull/44
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
BMH is showing powered off even when node is up, this is causing cu's software to behave incorrectly due to incorrect status on BMH $ oc get bmh -n openshift-machine-api control-1-ru2 -o json | jq '.status|.operationalStatus,.poweredOn,.provisioning.state' "OK" false "externally provisioned" Following error can be seen: 2023-10-10T06:05:02.554453960Z {"level":"info","ts":1696917902.5544183,"logger":"provisioner.ironic","msg":"could not update node settings in ironic, busy","host":"openshift-machine-api~control-1-ru4"}
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1.Launch the cluster with OCP v4.12.32 on Lenovo servers 2. 3.
Actual results:
It is giving false report of node status
Expected results:
It should report correct status of node
Additional info:
Description of problem:
Unable to edit Shipwright Builds with the upcoming builds for Red Hat OpenShift release (based on Shipwright v0.12.0) in the developer and admin consoles. Workaround is to use `oc edit build.shipwright.io ...`
Version-Release number of selected component (if applicable):
OCP 4.14 builds for OpenShift v1.0.0
How reproducible:
Always
Steps to Reproduce:
1. Deploy the builds for Red Hat OpenShift release candidate operator 2. Create a Build using the shp command line: `shp build create ...` 3. Open the Dev or Admin console for Shipwright Builds 4. Attempt to edit the Build object
Actual results:
Page appears to "freeze", does not let you edit.
Expected results:
Shipwright Build objects can be edited.
Additional info:
Can be reproduced by deploying the following "test catalog" - quay.io/adambkaplan/shipwright-io/operator-catalog:v0.13.0-rc7, then creating a subscription for the Shipwright operator. Will likely be easier to reproduce once we have the downstream operator in the Red Hat OperatorHub catalog.
Description of problem: MCN lister fires in the operator pod before the CRD exists. This causes API issues and could impact upgrades.
Version-Release number of selected component (if applicable):{code:none}
How reproducible: always
Steps to Reproduce:{code:none} 1. upgrade to 4.15 from any version 2. 3.
Actual results:
I1211 18:44:40.972098 1 operator.go:347] Starting MachineConfigOperator I1211 18:44:40.982079 1 event.go:298] Event(v1.ObjectReference{Kind:"", Namespace:"openshift-machine-config-operator", Name:"machine-config", UID:"68bc5e8f-b7f5-4506-a870-2eecaa5afd35", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/machine-config-operator started a version change from [{operator 4.14.6}] to [{operator 4.15.0-0.nightly-2023-12-11-033133}] W1211 18:44:41.255502 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 18:44:41.255587 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 18:58:04.915119 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) W1211 18:58:06.425952 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 18:58:06.426037 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) W1211 18:58:09.396004 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 18:58:09.396068 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) W1211 18:58:14.540488 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 18:58:14.540560 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) W1211 18:58:25.293029 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 18:58:25.293095 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) W1211 18:58:50.166866 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 18:58:50.166903 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) W1211 18:59:39.950454 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 18:59:39.950523 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) W1211 19:00:23.432005 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 19:00:23.432038 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) W1211 19:01:13.237298 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 19:01:13.237382 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) W1211 19:02:02.035555 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 19:02:02.035628 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) W1211 19:02:52.111260 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 19:02:52.111332 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) W1211 19:03:38.243461 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 19:03:38.243499 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) W1211 19:04:27.848493 1 reflector.go:535] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 19:04:27.848585 1 reflector.go:147] github.com/openshift/client-go/machineconfiguration/informers/externalversions/factory.go:116: Failed to watch *v1alpha1.MachineConfigNode: failed to list *v1alpha1.MachineConfigNode: the server could not find the requested resource (get machineconfignodes.machineconfiguration.openshift.io) E1211 19:05:37.064033 1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)" E1211 19:05:38.057685 1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)" E1211 19:05:39.036638 1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)" E1211 19:05:40.039736 1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)" E1211 19:05:41.039696 1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)" E1211 19:05:42.034840 1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)" E1211 19:05:43.044901 1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)" E1211 19:05:44.033229 1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)" E1211 19:05:45.034792 1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)" E1211 19:05:46.052866 1 sync.go:1250] Error syncing Required MachineConfigPools: "error MachineConfigPool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1)"
Expected results:
Additional info:
Description of problem:
In the 4.14 z-stream rollback job, I'm seeing test-case "[sig-network] pods should successfully create sandboxes by adding pod to network " fail. The job link is here https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-aws-ovn-upgrade-rollback-oldest-supported/1719037590788640768 The error is: 56 failures to create the sandbox ns/openshift-monitoring pod/prometheus-k8s-1 node/ip-10-0-48-75.us-east-2.compute.internal - 3314.57 seconds after deletion - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-k8s-1_openshift-monitoring_95d1a457-3e1b-4ae3-8b57-8023eec5937d_0(5b36bc12b2964e85bcdbe60b275d6a12ea68cb18b81f16622a6cb686270c4eb3): error adding pod openshift-monitoring_prometheus-k8s-1 to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): failed to send CNI request: Post "http://dummy/cni": EOF ns/openshift-monitoring pod/prometheus-k8s-1 node/ip-10-0-48-75.us-east-2.compute.internal - 3321.57 seconds after deletion - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-k8s-1_openshift-monitoring_95d1a457-3e1b-4ae3-8b57-8023eec5937d_0(3cc0afc5bec362566e4c3bdaf822209377102c2e39aaa8ef5d99b0f4ba795aaf): error adding pod openshift-monitoring_prometheus-k8s-1 to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): failed to send CNI request: Post "http://dummy/cni": dial unix /run/multus/socket/multus.sock: connect: connection refused
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-10-30-170011
How reproducible:
Flaky
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The rollback test is testing by installing 4.14.0, then upgrade to the latest 4.14.nightly, at some random point, rolling back to 4.14.0
This fix contains the following changes coming from updated version of kubernetes up to v1.28.5:
Changelog:
v1.28.4: https://github.com/kubernetes/kubernetes/blob/release-1.28/CHANGELOG/CHANGELOG-1.28.md#changelog-since-v1284
Description of problem:
The installer doesn’t do precheck if node architecture and vm type are consistent for aws and gcp, it works on azure
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-multi-2023-12-06-195439
How reproducible:
Always
Steps to Reproduce:
1.Config compute architecture field to arm64 but vm type choose amd64 instance type in install-config 2.Create cluster 3.Check installation
Actual results:
Azure will precheck if architecture is consistent with instance type when creating manifests, like: 12-07 11:18:24.452 [INFO] Generating manifests files.....12-07 11:18:24.452 level=info msg=Credentials loaded from file "/home/jenkins/ws/workspace/ocp-common/Flexy-install/flexy/workdir/azurecreds20231207-285-jd7gpj" 12-07 11:18:56.474 level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: controlPlane.platform.azure.type: Invalid value: "Standard_D4ps_v5": instance type architecture 'Arm64' does not match install config architecture amd64 But aws and gcp don’t have precheck, it will fail during installation, but many resources have been created. The case more likely to happen in multiarch cluster
Expected results:
The installer can do a precheck for architecture and vm type , especially for heterogeneous supported platforms(aws,gcp,azure)
Additional info:
Description of problem:
New spot VMs fail to be created by machinesets defining providerSpec.value.spotVMOptions in Azure regions without Availability Zones. Azure-controller logs the error: Azure Spot Virtual Machine is not supported in Availability Set. A new availabilitySet is created for each machineset in non-zonal regions, but this only works with normal nodes. Spot VMs and availabilitySets are incompatible as per Microsoft docs for this error: You need to choose to either use an Azure Spot Virtual Machine or use a VM in an availability set, you can't choose both. From: https://learn.microsoft.com/en-us/azure/virtual-machines/error-codes-spot
Version-Release number of selected component (if applicable):
n/a
How reproducible:
Always
Steps to Reproduce:
1. Follow the instructions to create a machineset to provision spot VMs: https://docs.openshift.com/container-platform/4.12/machine_management/creating_machinesets/creating-machineset-azure.html#machineset-creating-non-guaranteed-instance_creating-machineset-azure 2. New machines will be in Failed state: $ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api mabad-test-l5x58-worker-southindia-spot-c4qr5 Failed 7m17s openshift-machine-api mabad-test-l5x58-worker-southindia-spot-dtzsn Failed 7m17s openshift-machine-api mabad-test-l5x58-worker-southindia-spot-tzrhw Failed 7m28s 3. Events in the failed machines show errors creating spot VMs with availabilitySets: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreate 28s azure-controller InvalidConfiguration: failed to reconcile machine "mabad-test-l5x58-worker-southindia-spot-dx78z": failed to create vm mabad-test-l5x58-worker-southindia-spot-dx78z: failure sending request for machine mabad-test-l5x58-worker-southindia-spot-dx78z: cannot create vm: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code="OperationNotAllowed" Message="Azure Spot Virtual Machine is not supported in Availability Set. For more information, see http://aka.ms/AzureSpot/errormessages."
Actual results:
Machines stay in Failed state and nodes are not created
Expected results:
Machines get created and new spot VM nodes added to the cluster.
Additional info:
This problem was identified from a customer alert in an ARO cluster. ICM for ref (requires b- MSFT account): https://portal.microsofticm.com/imp/v3/incidents/incident/455463992/summary
The test:
[sig-network] pods should successfully create sandboxes by adding pod to network
Failed a couple payloads today with 1-2 failures in batches of 10 aggregated jobs. I looked at the most recent errors and they seem to often be the same:
1 failures to create the sandbox ns/openshift-monitoring pod/prometheus-k8s-1 node/ip-10-0-24-217.us-west-1.compute.internal - 475.52 seconds after deletion - reason/FailedCreatePodSandBox Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_prometheus-k8s-1_openshift-monitoring_c712fc61-5a1e-4cec-b6fa-18c8f2e91c0a_0(46df8384ffeb433fc0e4864262aa52f2ede570265c43bf8b0900f184b27b10f1): error adding pod openshift-monitoring_prometheus-k8s-1 to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): failed to send CNI request: Post "http://dummy/cni": EOF
This http://dummy/cni URL looked interesting and seemed worthy of a bug.
The problem is a rare failure overall, but happening quite frequently day to day, search.ci indicates lots of hits over the last two days in both 4.14 and 4.15, and seemingly ovn and sdn both:
Some of these will show as flakes as the test gets retried at times and then passes.
Additionally in 4.14 we are seeing similar failures reporting
No CNI configuration file in /etc/kubernetes/cni/net.d/. Has your network provider started?
4.14.0-0.nightly-2023-10-12-015817 show pod sandbox errors for azure & aws both show a drop from the 10th which comes after our force accept
4.14.0-0.nightly-2023-10-11-141212 had a host of failures but it is what killed aws sdn
4.14.0-0.nightly-2023-10-11-200059 aws sdn as well and shows in azure
Description of problem:
New machines got stuck in Provisioned state when the customer tried to scale the machineset.
~~~
NAME PHASE TYPE REGION ZONE AGE
ocp4-ftf8t-worker-2-wn6lp Provisioned 44m
ocp4-ftf8t-worker-redhat-x78s5 Provisioned 44m
~~~
Upon checking the journalctl logs from these VMs, we noticed that it was failing with "no space left on the device" errors while pulling images.
To troubleshoot the issue further we had to break root password in order to login and check the issue further.
Once root password was broken, we logged in to the system and check journalctl logs for failure errors.
We could see "no space left of device" for image pulls. Checking df -h output we could see /dev/sda4 (/dev/mapper/coreos-luks-root-nocrypt) which is mounted on /sysroot was 100% full.
As image would fail to get pulled, the machine-config-daemon-firstboot.service will not get completed. This would not allow us to get the node to 4.12, nor be part of the cluster.
The rest of the errors were side effect of the "no space left on device" error.
We could see that the /dev/sda4 was correctly partitioned to 120Gib. We compared to the working system and partition scheme matched.
The filesystem was only of 2.8 Gib instead of 120 Gib.
We manually extended the filesystem for / (xfs_growfs /) after which / mount was resized to 120Gib.
The node got rebooted once this step was performed and system came up fine with 4.12 Red Hat Coreos.
We waited for a while for the node to come up with kubelet and crio running, approved the certs and now the node is part of the cluster.
Later while checking the logs for RCA, we observed below errors from the logs which might help in determining why the sysroot mountpoint was not resized.
~~~
$ grep i growfs sos_commands/logs/journalctl_no-pager_-since_-3days
Jun 12 10:37:30 ocp4-ftf8t-worker-2-wn6lp systemd[1]: ignition-ostree-growfs.service: Failed to load configuration: No such file or directory <---
Jun 12 10:37:30 ocp4-ftf8t-worker-2-wn6lp systemd[1]: ignition-ostree-growfs.service: Collecting.
~~~
Version-Release number of selected component (if applicable):
OCP 4.12.18.
IPI installation on RHV.
How reproducible:
Not able to reproduce the issue.
Steps to Reproduce:
1. 2. 3.
Actual results:
The /sysroot mountpoint was not resized to the actual size of the /dev/sda4 partition which further prevented the machine-config-daemon-firstboot.service from completing and the node was stuck at RHCOS version 4.6.
Currently the customer has to manually resize the /sysroot mountpoint everytime he adds a new node in the cluster as a workaround.
Expected results:
The /sysroot mountpoint should be automatically resized as a part of ignition-ostree-growfs.sh script.
Additional info:
The customer has recently migrated from old storagedomain to a new one on RHV if that matters? However they performed successful machineset scaleup tests with the new storagedomain on OCP 4.11.33 (before upgrading OCP).
They started facing issue with all the machinesets (new/existing) only after they upgraded the OCP version to 4.12.18.
Please review the following PR: https://github.com/openshift/cluster-baremetal-operator/pull/362
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/oauth-apiserver/pull/93
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/258
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/100
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem: runtime zero namespaces ("default", "kube-system", "kube-public") are not excluded from pod security admission in hypershift guest cluster.
In OCP, these runtime zero namespaces are excluded from PSA.
How reproducible: Always
Steps to Reproduce:
1. Install a fresh 4.14 hypershift cluster 2. Check the labels under default, kube-system, kube-public namespaces 3. Try to change the PSA value on these namespaces in hypershift guest cluster and the values are getting updated.
Actual results:
$ oc get ns default -oyaml --kubeconfig=guest.kubeconfig ... labels: kubernetes.io/metadata.name: default name: default ... $ oc label ns default pod-security.kubernetes.io/enforce=restricted --overwrite --kubeconfig=guest.kubeconfig namespace/default labeled $ oc get ns default -oyaml --kubeconfig=guest.kubeconfig ... labels: kubernetes.io/metadata.name: default pod-security.kubernetes.io/enforce: restricted name: default
Expected results:
Runtime zero namespaces ("default", "kube-system", "kube-public") are excluded from pod security admission
Additional info:
kube-system ns is excluded from PSA in guest cluster but when try to update security.openshift.io/scc.podSecurityLabelSync value with true/false, it is not updated where as in management cluster podSecurityLabelSync value will get updated.
Description of problem:
An error message 'Restricted Access' and an 'Create Pod' button would be shown on Pods's page for a normal user without any project
Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-03-04-063157
How reproducible:
Always
Steps to Reproduce:
1. Login in OCP with a normal user, navigate to Pods page
2. Check if 'No Pods found' message will be shown on page, and the 'Create Pod' button will be hidden
3.
Actual results:
2. An error message 'Restricted Access' and an enabled 'Create Pod' button would be shown on pod's page
Expected results:
2. Should show ‘No Pods found’ message
Hide 'Create Pod' button
Additional info:
The same behavior can be checked on 'Deployment, Stateful Set, Job, Service' page which is correct
Description of problem:
OKD installer attempts to enable systemd-journal-gatewayd.socket, which is not present on FCOS
Version-Release number of selected component (if applicable):
4.13
Description of problem:
To make AWS Load Balancer Operator work on HyperShift, one of the requirements is the ELB tag should be set on subnets. see https://github.com/openshift/aws-load-balancer-operator/blob/main/docs/prerequisites.md#vpc-and-subnets The value of `kubernetes.io/role/elb` or `kubernetes.io/role/internal-elb`should be 1 or ``. but from the code below, hypershift uses "true" https://github.com/openshift/hypershift/blob/3e1db35d562d069797f9dec2b47227744f689684/cmd/infra/aws/ec2.go#L226
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. install hypershift cluster 2. check subnet tags 3.
Actual results:
value of `kubernetes.io/role/elb` is "true"
Expected results:
value of `kubernetes.io/role/elb` is 1 or ``
Additional info:
See log:
[controller-runtime] log.SetLogger(...) was never called, logs will not be displayed:
goroutine 422 [running]:
runtime/debug.Stack()
runtime/debug/stack.go:24 +0x65
sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot()
sigs.k8s.io/controller-runtime@v0.15.0/pkg/log/log.go:59 +0xbd
sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).WithName(0xc000521140, {0x2d0b2ef, 0x14})
sigs.k8s.io/controller-runtime@v0.15.0/pkg/log/deleg.go:147 +0x4c
github.com/go-logr/logr.Logger.WithName({Unknown macro: {0x31b3e78, 0xc000521140}, 0x0}, {0x2d0b2ef?, 0x40?})
github.com/go-logr/logr@v1.2.4/logr.go:336 +0x46
sigs.k8s.io/controller-runtime/pkg/client.newClient(0xc000471440, {0x0, 0x0,Unknown macro: {0x31b5c00, 0xc000eb3100}, 0x0, {0x0, 0x0}, 0x0})
sigs.k8s.io/controller-runtime@v0.15.0/pkg/client/client.go:115 +0xb4
sigs.k8s.io/controller-runtime/pkg/client.New(0x319b2b0?, {0x0, 0x0,, 0x0, {0x0, 0x0}, 0x0})
sigs.k8s.io/controller-runtime@v0.15.0/pkg/client/client.go:101 +0x85
github.com/openshift/cluster-network-operator/pkg/client.NewClusterClient(0xc000471440, 0xc000499b00)
github.com/openshift/cluster-network-operator/pkg/client/client.go:188 +0x2b0
github.com/openshift/cluster-network-operator/pkg/client.NewClient(0x0?, 0x0?, {0x2cecdf7, 0x7}, 0x0?)
github.com/openshift/cluster-network-operator/pkg/client/client.go:100 +0xa5
github.com/openshift/cluster-network-operator/pkg/operator.RunOperator({0x31ace70, 0xc0009a0b90}, 0xc000318a40, {0x2cecdf7, 0x7}, 0x0?)
github.com/openshift/cluster-network-operator/pkg/operator/operator.go:46 +0xbd
main.newNetworkOperatorCommand.func2({0x31ace70?, 0xc0009a0b90?}, 0x31acee0?)
github.com/openshift/cluster-network-operator/cmd/cluster-network-operator/main.go:49 +0x3b
github.com/openshift/library-go/pkg/controller/controllercmd.ControllerBuilder.getOnStartedLeadingFunc.func1.1()
github.com/openshift/library-go@v0.0.0-20230503144409-4cb26a344c37/pkg/controller/controllercmd/builder.go:351 +0x74
created by github.com/openshift/library-go/pkg/controller/controllercmd.ControllerBuilder.getOnStartedLeadingFunc.func1
github.com/openshift/library-go@v0.0.0-20230503144409-4cb26a344c37/pkg/controller/controllercmd/builder.go:349 +0x10a
I think we should want logs
This is a clone of issue OCPBUGS-29114. The following is the description of the original issue:
—
Description of problem:
When installing a new vSphere cluster with static IPs, control plane machine sets (CPMS) are also enabled in TechPreviewNoUpgrade and the installer applies the incorrect config to the CPMS resulting in masters being recreated.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
always
Steps to Reproduce:
1. create install-config.yaml with static IPs following documentation 2. run `openshift-install create cluster` 3. as install progresses, watch the machines definitions
Actual results:
new master machines are created
Expected results:
all machines are the same as what was created by the installer.
Additional info:
Tracker issue for bootimage bump in 4.15. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-18945.
Please review the following PR: https://github.com/openshift/platform-operators/pull/91
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/machine-api-provider-gcp/pull/75
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-api-provider-aws/pull/485
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/machine-api-provider-ibmcloud/pull/32
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Seeing failures for SDN periodics running [sig-network][Feature:tuning] sysctl allowlist update should start a pod with custom sysctl only when the sysctl is added to whitelist [Suite:openshift/conformance/parallel] beginning with 4.16.0-0.nightly-2024-01-05-205447
Jan 5 23:14:22.066: INFO: At 2024-01-05 23:14:09 +0000 UTC - event for testpod: {kubelet ip-10-0-54-42.us-west-2.compute.internal} FailedCreatePodSandBox: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_testpod_e2e-test-tuning-bzspr_2a9ce6e0-726d-47a6-ac64-71d430926574_0(968a55c5afd81e077b1d15a4129084d5f15002ac3ae6aa9fe32648e841940fe2): error adding pod e2e-test-tuning-bzspr_testpod to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): timed out waiting for the condition
That payload contains OCPBUGS-26222: Adds a wait on unix socket readiness not sure that is the cause but will investigate.
This is a clone of issue OCPBUGS-28251. The following is the description of the original issue:
—
Description of problem:
Trying to define multiple receivers in a single user-defined AlertmanagerConfig
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
#### Monitoring for user-defined projects is enabled ``` oc -n openshift-monitoring get configmap cluster-monitoring-config -o yaml | head -4 ``` ``` apiVersion: v1 data: config.yaml: | enableUserWorkload: true ``` #### separate Alertmanager instance for user-defined alert routing is Enabled and Configured ``` oc -n openshift-user-workload-monitoring get configmap user-workload-monitoring-config -o yaml | head -6 ``` ``` apiVersion: v1 data: config.yaml: | alertmanager: enabled: true enableAlertmanagerConfig: true ``` create testing namespace oc new-project libor-alertmanager-testing ``` ## TESTING - MULTIPLE RECEIVERS IN ALERTMANAGERCONFIG Single AlertmanagerConfig `alertmanager_config_webhook_and_email_rootDefault.yaml` ``` apiVersion: monitoring.coreos.com/v1beta1 kind: AlertmanagerConfig metadata: name: libor-alertmanager-testing-email-webhook namespace: libor-alertmanager-testing spec: receivers: - name: 'libor-alertmanager-testing-webhook' webhookConfigs: - url: 'http://prometheus-msteams.internal-monitoring.svc:2000/occ-alerts' - name: 'libor-alertmanager-testing-email' emailConfigs: - to: USER@USER.CO requireTLS: false sendResolved: true - name: Default route: groupBy: - namespace receiver: Default groupInterval: 60s groupWait: 60s repeatInterval: 12h routes: - matchers: - name: severity value: critical matchType: '=' continue: true receiver: 'libor-alertmanager-testing-webhook' - matchers: - name: severity value: critical matchType: '=' receiver: 'libor-alertmanager-testing-email' ``` Once saved the continue statement is removed from the object. ``` the configuration applied to alertmanager contains continue false statements ``` oc exec -n openshift-user-workload-monitoring alertmanager-user-workload-0 -- amtool config show --alertmanager.url http://localhost:9093 ``` route: receiver: Default group_by: - namespace continue: false routes: - receiver: libor-alertmanager-testing/libor-alertmanager-testing-email-webhook/Default group_by: - namespace matchers: - namespace="libor-alertmanager-testing" continue: true routes: - receiver: libor-alertmanager-testing/libor-alertmanager-testing-email-webhook/libor-alertmanager-testing-webhook matchers: - severity="critical" continue: false <---- - receiver: libor-alertmanager-testing/libor-alertmanager-testing-email-webhook/libor-alertmanager-testing-email matchers: - severity="critical" continue: false <----- ``` If I update the statements to read `continue: true` and test here: https://prometheus.io/webtools/alerting/routing-tree-editor/ then I get the desired results workaround is to use 2 separate files - the continue statement is being added.
Actual results:
Once saved the continue statement is removed from the object.
Expected results:
continue true statement is retain and applied to alertmanager
Additional info:
This is a clone of issue OCPBUGS-25708. The following is the description of the original issue:
—
Changes made for faster risk cache-warming (the OCPBUGS-19512 series) introduced an unfortunate cycle:
1. Cincinnati serves vulnerable PromQL, like graph-data#4524.
2. Clusters pick up that broken PromQL, try to evaluate, and fail. Re-eval-and-fail loop continues.
3. Cincinnati PromQL fixed, like graph-data#4528.
4. Cases:
The regression went back via:
Updates from those releases (and later in their 4.y, until this bug lands a fix) to later releases are exposed.
Likely very reproducible for exposed releases, but only when clusters are served PromQL risks that will consistently fail evaluation.
1. Launch a cluster.
2. Point it at dummy Cincinnati data, as described in OTA-520. Initially declare a risk with broken PromQL in that data, like cluster_operator_conditions.
3. Wait until the cluster is reporting Recommended=Unknown for those risks (oc adm upgrade --include-not-recommended).
4. Update the risk to working PromQL, like group(cluster_operator_conditions). Alternatively, update anything about the update-service data (e.g. adding a new update target with a path from the cluster's version).
5. Wait 10 minutes for the CVO to have plenty of time to pull that new Cincinnati data.
6. oc get -o json clusterversion version | jq '.status.conditionalUpdates[].risks[].matchingRules[].promql.promql' | sort | uniq | jq -r .
Exposed releases will still have the broken PromQL in their output (or will lack the new update target you added, or whatever the Cincinnati data change was).
Fixed releases will have picked up the fixed PromQL in their output (or will have the new update target you added, or whatever the Cincinnati data change was).
To detect exposure in collected Insights, look for EvaluationFailed conditionalUpdates like:
$ oc get -o json clusterversion version | jq -r '.status.conditionalUpdates[].conditions[] | select(.type == "Recommended" and .status == "Unknown" and .reason == "EvaluationFailed" and (.message | contains("invalid PromQL")))' { "lastTransitionTime": "2023-12-15T22:00:45Z", "message": "Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34\nAdding a new worker node will fail for clusters running on ARO. https://issues.redhat.com/browse/MCO-958", "reason": "EvaluationFailed", "status": "Unknown", "type": "Recommended" }
To confirm in-cluster vs. other EvaluationFailed invalid PromQL issues, you can look for Cincinnati retrieval attempts in CVO logs. Example from a healthy cluster:
$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 --since 30m | grep 'request updates from\|PromQL' | tail I1221 20:36:39.783530 1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?... I1221 20:36:39.831358 1 promql.go:118] evaluate PromQL cluster condition: "(\n group(cluster_operator_conditions{name=\"aro\"})\n or\n 0 * group(cluster_operator_conditions)\n)\n" I1221 20:40:19.674925 1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?... I1221 20:40:19.727998 1 promql.go:118] evaluate PromQL cluster condition: "(\n group(cluster_operator_conditions{name=\"aro\"})\n or\n 0 * group(cluster_operator_conditions)\n)\n" I1221 20:43:59.567369 1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?... I1221 20:43:59.620315 1 promql.go:118] evaluate PromQL cluster condition: "(\n group(cluster_operator_conditions{name=\"aro\"})\n or\n 0 * group(cluster_operator_conditions)\n)\n" I1221 20:47:39.457582 1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?... I1221 20:47:39.509505 1 promql.go:118] evaluate PromQL cluster condition: "(\n group(cluster_operator_conditions{name=\"aro\"})\n or\n 0 * group(cluster_operator_conditions)\n)\n" I1221 20:51:19.348286 1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://api.openshift.com/api/upgrades_info/v1/graph?... I1221 20:51:19.401496 1 promql.go:118] evaluate PromQL cluster condition: "(\n group(cluster_operator_conditions{name=\"aro\"})\n or\n 0 * group(cluster_operator_conditions)\n)\n"
showing fetch lines every few minutes. And from an exposed cluster, only showing PromQL eval lines:
$ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 --since 30m | grep 'request updates from\|PromQL' | tail I1221 20:50:10.165101 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:11.166170 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:12.166314 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:13.166517 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:14.166847 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:15.167737 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:16.168486 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:17.169417 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:18.169576 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 I1221 20:50:19.170544 1 availableupdates.go:123] Requeue available-update evaluation, because "4.13.25" is Recommended=Unknown: EvaluationFailed: Exposure to AROBrokenDNSMasq is unknown due to an evaluation failure: invalid PromQL result length must be one, but is 34 $ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 --since 30m | grep 'request updates from' | tail ...no hits...
If bitten, the remediation is to address the invalid PromQ. For example, we fixed that AROBrokenDNSMasq expression in graph-data#4528. And after that the local cluster administrator should restart their CVO, such as with:
$ oc -n openshift-cluster-version delete -l k8s-app=cluster-version-operator pods
The final iteration (of 3) of the fix for OCPBUGS-4248 - https://github.com/openshift/cluster-baremetal-operator/pull/341 - uses the (IPv6) API VIP as the IP address for IPv6 BMCs to contact Apache to download the image to mount via virtualmedia.
Since Apache runs as part of the metal3 Deployment, it exists on only one node. There is no guarantee that the API VIP will land (or stay) on the same node, so this fails to work more often than not. Kube-proxy does not do anything to redirect traffic to pods with host networking enabled, such as the metal3 Deployment.
The IPv6 is passed to the baremetal-operator. This has been split into its own Deployment since the first iteration of OCPBUGS-4228, in which we collected the IP address of the host from the deployed metal3 Pod. At the time that caused a circular dependency of the Deployment on its own Pod, but this would no longer be the case. However, a backport beyond 4.14 would require the Deployment split to also be backported.
Alternatively, ironic-proxy could be adapted to also proxy the images produced by ironic. This would be new functionality that would also need to be backported.
Finally, we could determine the host IP from inside the baremetal-operator container instead of from cluster-baremetal-operator. However, this approach has not been tried and would only work in backports because it relies on baremetal-operator continuing to run within same Pod as ironic.
It was found that OpenShift Container Platform 4 - Node(s) are missing certain settings applied via tuned and when starting to investigate the problem it was found that it takes up to 30 minutes or more for the tuned profiles of this newly added OpenShift Container Platform 4 - Node for being created.
When increasing the log level of cluster-node-tuning-operator pod we can see the following events being recorded.
I1128 13:05:12.465193 1 controller.go:1121] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (add) I1128 13:05:12.465235 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:05:12.465247 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:05:12.465255 1 profilecalculator.go:131] Node's new-worker-X.example.com providerID=aws:///eu-central-1c/i-0874090641dd61eef I1128 13:05:12.465268 1 controller.go:300] sync(): Node new-worker-X.example.com label(s) changed I1128 13:05:12.465288 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:05:12.486200 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:05:12.486233 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:05:12.486242 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:05:12.486256 1 controller.go:300] sync(): Node new-worker-X.example.com label(s) changed I1128 13:05:12.486273 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:05:12.612063 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:05:12.612114 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:05:12.612127 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:05:12.612149 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:05:15.232435 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:05:15.232477 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:05:15.232541 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:05:15.232565 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:05:22.805108 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:05:22.805142 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:05:22.805151 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:05:22.805170 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:05:30.803481 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:05:30.803511 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:05:30.803519 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:05:30.803533 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:05:35.815894 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:05:35.815933 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:05:35.815942 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:05:35.815958 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:05:35.832338 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:05:35.832386 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:05:35.832395 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:05:35.832419 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:05:35.851291 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:05:35.851337 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:05:35.851349 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:05:35.851369 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:05:40.855159 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:05:40.855192 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:05:40.855201 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:05:40.855221 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:05:48.004741 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:05:48.004783 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:05:48.004815 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:05:48.004835 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:05:48.011986 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:05:48.012035 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:05:48.012047 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:05:48.012067 1 controller.go:300] sync(): Node new-worker-X.example.com label(s) changed I1128 13:05:48.012090 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:05:53.475798 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:05:53.475842 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:05:53.475855 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:05:53.475876 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:05:56.097269 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:05:56.097299 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:05:56.097309 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:05:56.097329 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:05:58.497782 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:05:58.497838 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:05:58.497847 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:05:58.497864 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:06:06.117201 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:06:06.117235 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:06:06.117254 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:06:06.117271 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:06:08.008992 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:06:08.009031 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:06:08.009041 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:06:08.009059 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:06:09.685949 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:06:09.685988 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:06:09.685997 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:06:09.686015 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:06:11.163882 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:06:11.163929 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:06:11.163941 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:06:11.163965 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:06:19.730972 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:06:19.731005 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:06:19.731013 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:06:19.731028 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:06:23.713627 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:06:23.713665 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:06:23.713675 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:06:23.713693 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:07:52.133190 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:07:52.133227 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:07:52.133235 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:07:52.133268 1 controller.go:300] sync(): Node new-worker-X.example.com label(s) changed I1128 13:07:52.133285 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:07:55.779247 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:07:55.779278 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:07:55.779286 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:07:55.779324 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:07:55.799941 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:07:55.799975 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:07:55.799983 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:07:55.800021 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:07:56.062048 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:07:56.062081 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:07:56.062089 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:07:56.062126 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:09:58.224261 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:09:58.224294 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:09:58.224303 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:09:58.224333 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:10:08.146467 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:10:08.146504 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:10:08.146513 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:10:08.146549 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:10:29.293368 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:10:29.293402 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:10:29.293410 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:10:29.293440 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:11:38.765691 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:11:38.781424 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:11:38.781432 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:11:38.781471 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:15:35.022263 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:15:35.022303 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:15:35.022312 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:15:35.022349 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:20:41.252897 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:20:41.252942 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:20:41.252951 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:20:41.252988 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:21:38.768157 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:21:38.781098 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:21:38.781103 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:21:38.781133 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:25:47.684402 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:25:47.684445 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:25:47.684457 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:25:47.684494 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:25:53.336668 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:25:53.336700 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:25:53.336709 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:25:53.336738 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:25:57.754420 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:25:57.754453 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:25:57.754462 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:25:57.754491 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:26:03.987123 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:26:03.987188 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:26:03.987203 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:26:03.987258 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:26:38.231524 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:26:38.231558 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:26:38.231566 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:26:38.231602 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:27:08.845310 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:27:08.845349 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:27:08.845358 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:27:08.845398 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:27:49.797881 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:27:49.797919 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:27:49.797928 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:27:49.797958 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:27:49.856526 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:27:49.856566 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:27:49.856575 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:27:49.856612 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:27:49.904286 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:27:49.904341 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:27:49.904350 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:27:49.904400 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:30:02.351363 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:30:02.351398 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:30:02.351407 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:30:02.351440 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:30:03.719303 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:30:03.719338 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:30:03.719347 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:30:03.719380 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:30:33.316267 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:30:33.316297 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:30:33.316307 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:30:33.316336 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:30:33.330998 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:30:33.331030 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:30:33.331038 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:30:33.331066 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:31:31.688121 1 controller.go:221] sync(): Kind profile: openshift-cluster-node-tuning-operator/new-worker-X.example.com I1128 13:31:31.688136 1 controller.go:374] sync(): Profile new-worker-X.example.com I1128 13:31:31.688300 1 profilecalculator.go:164] calculateProfile(new-worker-X.example.com) I1128 13:31:31.688337 1 controller.go:677] syncProfile(): Profile new-worker-X.example.com not found, creating one [openshift-node] I1128 13:31:31.688396 1 request.go:1073] Request Body: {"kind":"Profile","apiVersion":"tuned.openshift.io/v1","metadata":{"name":"new-worker-X.example.com","namespace":"openshift-cluster-node-tuning-operator","creationTimestamp":null,"ownerReferences":[{"apiVersion":"tuned.openshift.io/v1","kind":"Tuned","name":"default","uid":"324f82ad-4475-4b49-ac29-57cb454314e7","controller":true,"blockOwnerDeletion":true}]},"spec":{"config":{"tunedProfile":"openshift-node","debug":false,"tunedConfig":{"reapply_sysctl":null}}},"status":{"bootcmdline":"","tunedProfile":"","conditions":[{"type":"Applied","status":"Unknown","lastTransitionTime":"2023-11-28T13:31:31Z"},{"type":"Degraded","status":"Unknown","lastTransitionTime":"2023-11-28T13:31:31Z"}]}} I1128 13:31:31.698807 1 request.go:1073] Response Body: {"apiVersion":"tuned.openshift.io/v1","kind":"Profile","metadata":{"creationTimestamp":"2023-11-28T13:31:31Z","generation":1,"managedFields":[{"apiVersion":"tuned.openshift.io/v1","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:ownerReferences":{".":{},"k:{\"uid\":\"324f82ad-4475-4b49-ac29-57cb454314e7\"}":{}}},"f:spec":{".":{},"f:config":{".":{},"f:debug":{},"f:tunedConfig":{},"f:tunedProfile":{}}}},"manager":"cluster-node-tuning-operator","operation":"Update","time":"2023-11-28T13:31:31Z"}],"name":"new-worker-X.example.com","namespace":"openshift-cluster-node-tuning-operator","ownerReferences":[{"apiVersion":"tuned.openshift.io/v1","blockOwnerDeletion":true,"controller":true,"kind":"Tuned","name":"default","uid":"324f82ad-4475-4b49-ac29-57cb454314e7"}],"resourceVersion":"9673729653","uid":"8607cf52-9a00-49d2-baff-8a97c73b809a"},"spec":{"config":{"debug":false,"tunedConfig":{},"tunedProfile":"openshift-node"}}} I1128 13:31:31.698915 1 controller.go:687] created profile new-worker-X.example.com [openshift-node] I1128 13:31:31.698925 1 controller.go:209] event from workqueue (profile/openshift-cluster-node-tuning-operator/new-worker-X.example.com) successfully processed I1128 13:31:31.702309 1 controller.go:1121] add event to workqueue due to *v1.Profile, Namespace=openshift-cluster-node-tuning-operator, Name=new-worker-X.example.com (add) I1128 13:31:31.702335 1 controller.go:221] sync(): Kind profile: openshift-cluster-node-tuning-operator/new-worker-X.example.com I1128 13:31:31.702358 1 controller.go:374] sync(): Profile new-worker-X.example.com I1128 13:31:31.702494 1 profilecalculator.go:164] calculateProfile(new-worker-X.example.com) I1128 13:31:31.713444 1 controller.go:752] syncProfile(): updating Profile new-worker-X.example.com [openshift-node] I1128 13:31:31.713543 1 request.go:1073] Request Body: {"kind":"Profile","apiVersion":"tuned.openshift.io/v1","metadata":{"name":"new-worker-X.example.com","namespace":"openshift-cluster-node-tuning-operator","uid":"8607cf52-9a00-49d2-baff-8a97c73b809a","resourceVersion":"9673729653","generation":1,"creationTimestamp":"2023-11-28T13:31:31Z","ownerReferences":[{"apiVersion":"tuned.openshift.io/v1","kind":"Tuned","name":"default","uid":"324f82ad-4475-4b49-ac29-57cb454314e7","controller":true,"blockOwnerDeletion":true}],"managedFields":[{"manager":"cluster-node-tuning-operator","operation":"Update","apiVersion":"tuned.openshift.io/v1","time":"2023-11-28T13:31:31Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:ownerReferences":{".":{},"k:{\"uid\":\"324f82ad-4475-4b49-ac29-57cb454314e7\"}":{}}},"f:spec":{".":{},"f:config":{".":{},"f:debug":{},"f:tunedConfig":{},"f:tunedProfile":{}}}}}]},"spec":{"config":{"tunedProfile":"openshift-node","debug":false,"tunedConfig":{"reapply_sysctl":null},"providerName":"aws"}},"status":{"bootcmdline":"","tunedProfile":"","conditions":[{"type":"Applied","status":"Unknown","lastTransitionTime":"2023-11-28T13:31:31Z"},{"type":"Degraded","status":"Unknown","lastTransitionTime":"2023-11-28T13:31:31Z"}]}} I1128 13:31:31.713611 1 round_trippers.go:466] curl -v -XPUT -H "User-Agent: cluster-node-tuning-operator/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Accept: application/json, */*" -H "Authorization: Bearer <masked>" -H "Content-Type: application/json" 'https://172.16.0.1:443/apis/tuned.openshift.io/v1/namespaces/openshift-cluster-node-tuning-operator/profiles/new-worker-X.example.com' I1128 13:31:31.720708 1 round_trippers.go:553] PUT https://172.16.0.1:443/apis/tuned.openshift.io/v1/namespaces/openshift-cluster-node-tuning-operator/profiles/new-worker-X.example.com 200 OK in 7 milliseconds I1128 13:31:31.720855 1 request.go:1073] Response Body: {"apiVersion":"tuned.openshift.io/v1","kind":"Profile","metadata":{"creationTimestamp":"2023-11-28T13:31:31Z","generation":2,"managedFields":[{"apiVersion":"tuned.openshift.io/v1","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:ownerReferences":{".":{},"k:{\"uid\":\"324f82ad-4475-4b49-ac29-57cb454314e7\"}":{}}},"f:spec":{".":{},"f:config":{".":{},"f:debug":{},"f:providerName":{},"f:tunedConfig":{},"f:tunedProfile":{}}}},"manager":"cluster-node-tuning-operator","operation":"Update","time":"2023-11-28T13:31:31Z"}],"name":"new-worker-X.example.com","namespace":"openshift-cluster-node-tuning-operator","ownerReferences":[{"apiVersion":"tuned.openshift.io/v1","blockOwnerDeletion":true,"controller":true,"kind":"Tuned","name":"default","uid":"324f82ad-4475-4b49-ac29-57cb454314e7"}],"resourceVersion":"9673729659","uid":"8607cf52-9a00-49d2-baff-8a97c73b809a"},"spec":{"config":{"debug":false,"providerName":"aws","tunedConfig":{},"tunedProfile":"openshift-node"}}} I1128 13:31:31.720946 1 controller.go:757] updated profile new-worker-X.example.com [openshift-node] I1128 13:31:31.720955 1 controller.go:209] event from workqueue (profile/openshift-cluster-node-tuning-operator/new-worker-X.example.com) successfully processed I1128 13:31:31.721160 1 controller.go:1136] add event to workqueue due to *v1.Profile, Namespace=openshift-cluster-node-tuning-operator, Name=new-worker-X.example.com (update) I1128 13:31:31.724833 1 controller.go:221] sync(): Kind profile: openshift-cluster-node-tuning-operator/new-worker-X.example.com I1128 13:31:31.724847 1 controller.go:374] sync(): Profile new-worker-X.example.com I1128 13:31:31.724971 1 profilecalculator.go:164] calculateProfile(new-worker-X.example.com) I1128 13:31:31.726987 1 controller.go:742] syncProfile(): no need to update Profile new-worker-X.example.com I1128 13:31:31.726993 1 controller.go:209] event from workqueue (profile/openshift-cluster-node-tuning-operator/new-worker-X.example.com) successfully processed I1128 13:31:32.273200 1 controller.go:1136] add event to workqueue due to *v1.Profile, Namespace=openshift-cluster-node-tuning-operator, Name=new-worker-X.example.com (update) I1128 13:31:32.273234 1 controller.go:221] sync(): Kind profile: openshift-cluster-node-tuning-operator/new-worker-X.example.com I1128 13:31:32.273246 1 controller.go:374] sync(): Profile new-worker-X.example.com I1128 13:31:32.273410 1 profilecalculator.go:164] calculateProfile(new-worker-X.example.com) I1128 13:31:32.284388 1 controller.go:742] syncProfile(): no need to update Profile new-worker-X.example.com I1128 13:31:32.284400 1 controller.go:209] event from workqueue (profile/openshift-cluster-node-tuning-operator/new-worker-X.example.com) successfully processed I1128 13:31:38.766803 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:31:38.769582 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:31:38.769588 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:31:38.769617 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed I1128 13:35:39.839137 1 controller.go:1136] add event to workqueue due to *v1.Node, Name=new-worker-X.example.com (update) I1128 13:35:39.839174 1 controller.go:221] sync(): Kind node: /new-worker-X.example.com I1128 13:35:39.839182 1 controller.go:282] sync(): Node new-worker-X.example.com I1128 13:35:39.839215 1 controller.go:209] event from workqueue (node//new-worker-X.example.com) successfully processed
So at 13:05:12 the OpenShift Container Platform 4 - Node called `new-worker-X.example.com` would indeed become available but it still took until 13:31:31 until the tuned profile was created and therefore required settings on the OpenShift Container Platform 4 - Node are being applied.
Description of problem:
As the original PR has been merged, open the new bug for tracking the issue in Doc
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Sheet format issue for 'useActiveColumns','K8sGetResource', ' k8sDeleteResource,'k8sListResource', 'K8sUpdateResource' and 'k8sPatchResource' Attached: https://drive.google.com/file/d/1NgitSi9mgB3zluqmp8eza4DhFOVY-Pt9/view?usp=drive_link 2. The text 'code' is not highlight in 'getGroupVersionKindForModel' Attached: https://drive.google.com/file/d/1sVxXdlIBxKxxokZX2iorJOER7ILGByzm/view?usp=drive_link 3. Incorrect </br> setting in 'ErrorBoundaryFallbackPage' https://drive.google.com/file/d/1ubhcFb68kDwL-wKsknP1Hb0fos480OnA/view?usp=drive_link 4. Several links marked with label {@link}: ListPageCreate, useK8sModel,k8sGetResource,k8sDeleteResource, k8sListResource, k8sListResourceItems,YAMLEditor
Actual results:
Expected results:
Additional info:
Impacted Code Line: https://github.com/Mylanos/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#L616-L619 https://github.com/Mylanos/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#L1277-L1283 https://github.com/Mylanos/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#L1404-L1410 https://github.com/Mylanos/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#L1434-L1437 https://github.com/Mylanos/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#L1335-L1341 https://github.com/Mylanos/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#L1365-L1370 https://github.com/Mylanos/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#L1528 https://github.com/Mylanos/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#L2157 https://github.com/Mylanos/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#L698 https://github.com/Mylanos/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#L1035 https://github.com/Mylanos/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#L1452 https://github.com/Mylanos/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#L2480
Description of problem:
I noticed this in the logs at https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-dns-operator/373/pull-ci-openshift-cluster-dns-operator-master-e2e-aws-ovn-operator/1704287854600916992/build-log.txt: === RUN TestCoreDNSDaemonSetReconciliation [controller-runtime] log.SetLogger(...) was never called, logs will not be displayed: goroutine 205 [running]: runtime/debug.Stack() /usr/lib/golang/src/runtime/debug/stack.go:24 +0x65 sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot() /go/src/github.com/openshift/cluster-dns-operator/vendor/sigs.k8s.io/controller-runtime/pkg/log/log.go:59 +0xbd sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).WithName(0xc000061340, {0x182213b, 0x14}) /go/src/github.com/openshift/cluster-dns-operator/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:147 +0x4c github.com/go-logr/logr.Logger.WithName({{0x1aa8468, 0xc000061340}, 0x0}, {0x182213b?, 0x0?}) /go/src/github.com/openshift/cluster-dns-operator/vendor/github.com/go-logr/logr/logr.go:336 +0x46 sigs.k8s.io/controller-runtime/pkg/client.newClient(0xc000789200, {0x0, 0xc0001a7730, {0x1aa9d90, 0xc00011c700}, 0x0, {0x0, 0x0}, 0x0}) /go/src/github.com/openshift/cluster-dns-operator/vendor/sigs.k8s.io/controller-runtime/pkg/client/client.go:115 +0xb4 sigs.k8s.io/controller-runtime/pkg/client.New(0xc000789200?, {0x0, 0xc0001a7730, {0x1aa9d90, 0xc00011c700}, 0x0, {0x0, 0x0}, 0x0}) /go/src/github.com/openshift/cluster-dns-operator/vendor/sigs.k8s.io/controller-runtime/pkg/client/client.go:101 +0x85 github.com/openshift/cluster-dns-operator/pkg/operator/client.NewClient(0x0?) /go/src/github.com/openshift/cluster-dns-operator/pkg/operator/client/client.go:52 +0x145 github.com/openshift/cluster-dns-operator/test/e2e.getClient() /go/src/github.com/openshift/cluster-dns-operator/test/e2e/utils.go:451 +0x77 github.com/openshift/cluster-dns-operator/test/e2e.TestCoreDNSDaemonSetReconciliation(0xc000501520) /go/src/github.com/openshift/cluster-dns-operator/test/e2e/operator_test.go:330 +0x45 testing.tRunner(0xc000501520, 0x193c038) /usr/lib/golang/src/testing/testing.go:1576 +0x10b created by testing.(*T).Run /usr/lib/golang/src/testing/testing.go:1629 +0x3ea operator_test.go:374: found "foo" node selector on daemonset openshift-dns/dns-default: <nil> operator_test.go:378: observed absence of "foo" node selector on daemonset openshift-dns/dns-default: <nil> --- PASS: TestCoreDNSDaemonSetReconciliation (1.63s)
We need to make a minor change in https://github.com/openshift/cluster-dns-operator/blob/7d2a16c0abf80d09fdcbeef8464994b78aa0589d/test/e2e/operator_test.go#L374-L375
Version-Release number of selected component (if applicable):
4.15 and earlier
How reproducible:
Be unlucky in CI testing
Steps to Reproduce:
1. 2. 3.
Actual results:
Stack trace and prints a <nil> operator_test.go:374: found "foo" node selector on daemonset openshift-dns/dns-default: <nil> operator_test.go:378: observed absence of "foo" node selector on daemonset openshift-dns/dns-default: <nil>
Expected results:
No stack trace and no print of <nil>
Additional info:
Description of problem:
Ironic image downstream lacks the configuration option added upstream
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-23518. The following is the description of the original issue:
—
When upgrading a HC from 4.13 to 4.14, after admin-acking the API deprecation check, the upgrade is still blocked by the ClusterVersionUpgradeble condition on the HC being Unknown. This is because the CVO in the guest cluster does not have an Upgradeable condition anymore.
Running the command coreos-installer iso kargs show no longer works with the 4.13 Agent ISO. Instead we get this error:
$ coreos-installer iso kargs show agent.x86_64.iso Writing manifest to image destination Storing signatures Error: No karg embed areas found; old or corrupted CoreOS ISO image.
This is almost certainly due to the way we repack the ISO as part of embedding the agent-tui binary in it.
It worked fine in 4.12. I have tested both with every version of coreos-installer from 0.14 to 0.17
RHOCP installation on RHOSP fails with an error
~~~
$ ansible-playbook -i inventory.yaml security-groups.yaml
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Incompatible openstacksdk library found: Version MUST be >=1.0 and <=None, but 0.36.5 is smaller than minimum version 1.0."}
~~~
Packages Installed :
ansible-2.9.27-1.el8ae.noarch Fri Oct 13 06:56:05 2023
python3-netaddr-0.7.19-8.el8.noarch Fri Oct 13 06:55:44 2023
python3-openstackclient-4.0.2-2.20230404115110.54bf2c0.el8ost.noarch Tue Nov 21 01:38:32 2023
python3-openstacksdk-0.36.5-2.20220111021051.feda828.el8ost.noarch Fri Oct 13 06:55:52 2023
Document followed :
https://docs.openshift.com/container-platform/4.13/installing/installing_openstack/installing-openstack-user.html#installation-osp-downloading-modules_installing-openstack-user
As part of OCPBUGS-18641 we have created a code that appends internal OVN-K8s subnet `fd69::2/128` to the `ExcludeNetworkSubnetCIDR` list for dual-stack installations.
What has been discovered now is that for IPv6-only clusters this network is not present on this list even though it should be.
This is causing vSphere IPv6-only setups to work incorrectly.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
We would like to include the CEL IP and CIDR validations in 4.16. They have been mergeded upstream and can be backported into OpenShift to improve out validation downstream. Upstream PR: https://github.com/kubernetes/kubernetes/pull/121912
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
When setting up router sharding with `endpointPublishingStrategy: Private` in a OCP 4.13.11 BareMetal cluster, the restricted-readonly scc is added to the router pods. Causing them to CrashLoopBackOff:
~~~
$ oc get pod -n openshift-ingress router-spinque-xxx -oyaml | grep -i scc
openshift.io/scc: restricted-readonly <<<
$ oc get pod -n openshift-ingress router-spinque-xxxj -oyaml | grep -i scc
openshift.io/scc: restricted-readonly <<<<
$ oc get pod -n openshift-ingress router-spinque-xxx -oyaml | grep -i scc
openshift.io/scc: restricted-readonly <<<<
~~~
~~~
router-spinque-xxx 0/1 CrashLoopBackOff 27 2h
router-spinque-xxx 0/1 CrashLoopBackOff 27 2h
router-spinque-xxx 0/1 CrashLoopBackOff 27 2h
~~~
Please find the must-gather as well as the sos-report from one of the nodes in the case 03624389 in supportshell
—
The following scc config can be used to reproduce this issue on any platform:
allowPrivilegeEscalation: true allowedCapabilities: [] apiVersion: security.openshift.io/v1 defaultAddCapabilities: null fsGroup: type: MustRunAs groups: - system:authenticated kind: SecurityContextConstraints metadata: name: bad-router priority: 0 readOnlyRootFilesystem: true requiredDropCapabilities: - KILL - MKNOD - SETUID - SETGID runAsUser: type: MustRunAsRange seLinuxContext: type: MustRunAs supplementalGroups: type: RunAsAny users: [] volumes: - configMap - downwardAPI - emptyDir - persistentVolumeClaim - projected - secret
Save the above yaml as bad-router-scc.yaml then apply it to your cluster:
$ oc apply -f bad-router-scc.yaml
Force the restart of router pods, such as by deleting one:
$ oc delete pod router-default-6465854689-gvjhs
The newly started pod(s) should be running but not ready, with the bad-router scc:
$ oc get pods NAME READY STATUS RESTARTS AGE router-default-6465854689-7x558 0/1 Running 0 49s $ oc get pod router-default-6465854689-7x558 -o yaml|grep scc openshift.io/scc: bad-router
If you wait long enough, it will restart multiple times, and eventually enter the CrashLoopBackOff state
When there is an error on HTTP listen, webhook does not handle the error in a way that recovery is possible and instead hangs without printing anything useful on the logs.
Seen after this change https://issues.redhat.com//browse/OCPBUGS-20104 where the webhook was re-configured to run as non-root but listen would fail on upgrade as the old webhook instance was running as root which causes an error due to the SOREUSE socket option.
The webhook should crashloop instead which would provide a chance of recovery although the recovery itself might still be racey depending on whether k8s is able to kill the old webhook instance before noticing the crash of the new instance.
Description of problem:
documentationBaseURL still points to 4.14
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1.Check documentationBaseURL on 4.15 cluster: # oc get configmap console-config -n openshift-console -o yaml | grep documentationBaseURL documentationBaseURL: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.14/ 2. 3.
Actual results:
1.documentationBaseURL is still pointing to 4.14
Expected results:
1.documentationBaseURL should point to 4.15
Additional info:
This is a tracker bug for issues discovered when working on https://issues.redhat.com/browse/METAL-940. No QA verification will be possible until the feature is implemented much later.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Issue 19 from https://docs.google.com/spreadsheets/d/1TR3ENY-GE_LQL9F-xH6NahtRHdu6IrYGhLqDDA8y0EI/edit#gid=1035185624
Horizontal alignment is slightly off between text and icon
Screenshot: https://drive.google.com/file/d/1nzFHCeorlVIMbwlnjzEc1fCW0GXQa1KT/view
openshift-install makes many calls to OpenStack APIs when installing OpenShift on OpenStack. Currently all of these calls use the same default User-Agent header gophercloud/x.y.z, where x.y.z is the version of the gophercloud that openshift-install was built with.
Keystone logs the User-Agent string, as do other OpenStack services, and it can provide important information about who is interacting with the cloud. As recently seen in OCPBUGS-14049, it can also be useful when debugging issues with components.
We should configure the User-Agent header for openshift-install and all other OpenShift components that talk to OpenStack APIs.
Please review the following PR: https://github.com/openshift/origin/pull/28264
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-26594. The following is the description of the original issue:
—
Component Readiness has found a potential regression in [sig-arch] events should not repeat pathologically for ns/openshift-monitoring.
Probability of significant regression: 100.00%
Sample (being evaluated) Release: 4.15
Start Time: 2024-01-04T00:00:00Z
End Time: 2024-01-10T23:59:59Z
Success Rate: 42.31%
Successes: 11
Failures: 15
Flakes: 0
Base (historical) Release: 4.14
Start Time: 2023-10-04T00:00:00Z
End Time: 2023-10-31T23:59:59Z
Success Rate: 100.00%
Successes: 151
Failures: 0
Flakes: 0
Please review the following PR: https://github.com/openshift/cloud-provider-ibm/pull/60
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This bug is to track the work needed to merge the CNO IPsec API backports
Description of problem:
Based on Azure doc [1], NCv2 series Azure virtual machines (VMs) are retired on September 6, 2023. VM could not be provisioned on those instance types. So remove standardNCSv2Family from azure doc tested_instance_types_x86_64 on 4.13+. [1] https://learn.microsoft.com/en-us/azure/virtual-machines/ncv2-series
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. cluster is installed failed on NCv2 series instance type 2. 3.
Actual results:
Expected results:
Additional info:
When looking at an ACM must-gather for a managed cluster, no information for the ConfigurationPolicies can be seen. It appears that this command in the must-gather script has an error:
oc adm inspect configurationpolicies.policy.open-cluster-management.io --all-namespaces --dest-dir=must-gather
The error (which is not logged in the must-gather itself...) looks like:
error: errors ocurred while gathering data:
skipping gathering due to error: the server doesn't have a resource type ""
ConfigurationPolicy YAML should be collected in the must-gather to help in debugging.
Samples operator in OKD refers to docker.io/openshift/wildfly, which are no longer available. Library sync should update samples to use quay.io links
Description of problem:
The name for ImageDigestMirrorSet created by oc-mirror is not valid
Version-Release number of selected component (if applicable):
4.15
How reproducible:
always
Steps to Reproduce:
1. Use the idcp yaml file created by oc-mirror , will hit error
Actual results:
cat out/working-dir/cluster-resources/idms_2023-11-16T04\:04\:49Z.yaml
apiVersion: config.openshift.io/v1
kind: ImageDigestMirrorSet
metadata:
creationTimestamp: null
name: idms_2023-11-16T04:04:49Z
spec:
imageDigestMirrors:
- mirrors:
- ec2-3-143-247-94.us-east-2.compute.amazonaws.com:5000/ocp/openshift-release-dev
source: quay.io/openshift-release-dev
- mirrors:
- ec2-3-143-247-94.us-east-2.compute.amazonaws.com:5000/ocp/openshift
source: localhost:5005/openshift
status: {}
oc create -f out/working-dir/cluster-resources/idms_2023-11-16T04\:04\:49Z.yaml
The ImageDigestMirrorSet "idms_2023-11-16T04:04:49Z" is invalid: metadata.name: Invalid value: "idms_2023-11-16T04:04:49Z": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. 'example.com', regex used for validation is '[a-z0-9]([-a-z0-9][a-z0-9])?(\.[a-z0-9]([-a-z0-9][a-z0-9])?)*')
Expected results:
name valid and no error.
Additional info:
This is a clone of issue OCPBUGS-18115. The following is the description of the original issue:
—
Description of problem:
After enabling user-defined monitoring on an HyperShift hosted cluster, PrometheusOperatorRejectedResources starts firing.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Start an hypershift-hosted cluster with cluster-bot 2. Enable user-defined monitoring 3.
Actual results:
PrometheusOperatorRejectedResources alert becomes firing
Expected results:
No alert firing
Additional info:
Need to reach out to the HyperShift folks as the fix should probably be in their code base.
Description of problem:
Install cluster with azure marketplace image 413.92.2023101700, and set field osImage:plan to NoPurchasePlan. install-config.yaml: -------------------- platform: azure: baseDomainResourceGroupName: os4-common cloudName: AzurePublicCloud outboundType: Loadbalancer region: southcentralus defaultMachinePlatform: osImage: offer: rh-ocp-worker publisher: redhat sku: rh-ocp-worker-gen1 version: 413.92.2023101700 plan: NoPurchasePlan Bootstrap vm is provisioned failed with below terraform error: DEBUG In addition to the other similar warnings shown, 3 other variable(s) defined DEBUG without being declared. ERROR ERROR Error: waiting for creation of Linux Virtual Machine: (Name "jima02test-7jf8d-bootstrap" / Resource Group "jima02test-7jf8d-rg"): Code="VMMarketplaceInvalidInput" Message="Creating a virtual machine from Marketplace image or a custom image sourced from a Marketplace image requires Plan information in the request. VM: '/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima02test-7jf8d-rg/providers/Microsoft.Compute/virtualMachines/jima02test-7jf8d-bootstrap'." ERROR ERROR with azurerm_linux_virtual_machine.bootstrap, ERROR on main.tf line 194, in resource "azurerm_linux_virtual_machine" "bootstrap": ERROR 194: resource "azurerm_linux_virtual_machine" "bootstrap" { ERROR ERROR failed to fetch Cluster: failed to generate asset "Cluster": failure applying terraform for "bootstrap" stage: failed to create cluster: failed to apply Terraform: exit status 1 ERROR ERROR Error: waiting for creation of Linux Virtual Machine: (Name "jima02test-7jf8d-bootstrap" / Resource Group "jima02test-7jf8d-rg"): Code="VMMarketplaceInvalidInput" Message="Creating a virtual machine from Marketplace image or a custom image sourced from a Marketplace image requires Plan information in the request. VM: '/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima02test-7jf8d-rg/providers/Microsoft.Compute/virtualMachines/jima02test-7jf8d-bootstrap'." ERROR ERROR with azurerm_linux_virtual_machine.bootstrap, ERROR on main.tf line 194, in resource "azurerm_linux_virtual_machine" "bootstrap": ERROR 194: resource "azurerm_linux_virtual_machine" "bootstrap" { ERROR ERROR
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-11-01-235040
How reproducible:
Always
Steps to Reproduce:
1. set azure marketplace image(it has purchase plan) and plan:NoPurchasePlan in install-config.yaml file 2. trigger the installation 3.
Actual results:
bootstrap vm is provisioned failed.
Expected results:
installer should have some validation for plan when using marketplace image with purchase plan, and exit earlier with proper message
Additional info:
Description of problem:
Vsphere IPI installation is getting failed with panic: runtime error: invalid memory address or nil pointer dereference
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Download 4.13 installation binary 2. Run openshift-install create cluster command.
Actual results:
Error: DEBUG Generating Platform Provisioning Check... panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x50 pc=0x3401c4e]goroutine 1 [running]: github.com/openshift/installer/pkg/asset/installconfig/vsphere.validateESXiVersion(0xc001524060?, {0xc00018aff0, 0x43}, 0x1?, 0x1?) /go/src/github.com/openshift/installer/pkg/asset/installconfig/vsphere/validation.go:279 +0xb6e github.com/openshift/installer/pkg/asset/installconfig/vsphere.validateFailureDomain(0xc001524060, 0xc00022c840, 0x0) /go/src/github.com/openshift/installer/pkg/asset/installconfig/vsphere/validation.go:167 +0x6b6 github.com/openshift/installer/pkg/asset/installconfig/vsphere.ValidateForProvisioning(0xc0003d4780) /go/src/github.com/openshift/installer/pkg/asset/installconfig/vsphere/validation.go:132 +0x675 github.com/openshift/installer/pkg/asset/installconfig.(*PlatformProvisionCheck).Generate(0xc0000f2000?, 0x5?) /go/src/github.com/openshift/installer/pkg/asset/installconfig/platformprovisioncheck.go:112 +0x45f github.com/openshift/installer/pkg/asset/store.(*storeImpl).fetch(0xc000925e90, {0x1dc012d0, 0x2279afa8}, {0x7c34091, 0x2}) /go/src/github.com/openshift/installer/pkg/asset/store/store.go:226 +0x5fa github.com/openshift/installer/pkg/asset/store.(*storeImpl).fetch(0xc000925e90, {0x1dc01090, 0x22749ce0}, {0x0, 0x0}) /go/src/github.com/openshift/installer/pkg/asset/store/store.go:220 +0x75b github.com/openshift/installer/pkg/asset/store.(*storeImpl).Fetch(0x7ffe670305f1?, {0x1dc01090, 0x22749ce0}, {0x227267a0, 0x8, 0x8}) /go/src/github.com/openshift/installer/pkg/asset/store/store.go:76 +0x48 main.runTargetCmd.func1({0x7ffe670305f1, 0x6}) /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:260 +0x125 main.runTargetCmd.func2(0x2272da00?, {0xc000925410?, 0x3?, 0x3?}) /go/src/github.com/openshift/installer/cmd/openshift-install/create.go:290 +0xe7 github.com/spf13/cobra.(*Command).execute(0x2272da00, {0xc000925380, 0x3, 0x3}) /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:920 +0x847 github.com/spf13/cobra.(*Command).ExecuteC(0xc000210900) /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:1040 +0x3bd github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/installer/vendor/github.com/spf13/cobra/command.go:968 main.installerMain() /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:61 +0x2b0 main.main() /go/src/github.com/openshift/installer/cmd/openshift-install/main.go:38 +0xff
Expected results:
Installation to be completed successfully.
Additional info:
All metal jobs failed a bunch of tests with errors about looking up thanos DNS record.
{ fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:106]: Failed to fetch alerting rules: unable to query https://thanos-querier-openshift-monitoring.apps.ostest.test.metalkube.org/api/v1/rules: Get "https://thanos-querier-openshift-monitoring.apps.ostest.test.metalkube.org/api/v1/rules": dial tcp: lookup thanos-querier-openshift-monitoring.apps.ostest.test.metalkube.org on 172.30.0.10:53: no such host: %!w(<nil>) Ginkgo exit error 1: exit with code 1}
[sig-instrumentation][Late] OpenShift alerting rules [apigroup:image.openshift.io] should link to an HTTP(S) location if the runbook_url annotation is defined [Suite:openshift/conformance/parallel]
[sig-instrumentation][Late] OpenShift alerting rules [apigroup:image.openshift.io] should link to an HTTP(S) location if the runbook_url annotation is defined [Suite:openshift/conformance/parallel]
Description of problem:
Enable installer AWS SDK install, and create a C2S cluster, will hit following fatal error: level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to create bootstrap resources: failed to create bootstrap instance profile: failed to create role (yunjiang-14c2a-t4wp7-bootstrap-role): RequestCanceled: request context canceled
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-03-140457 4.16.0-0.nightly-2024-01-03-193825
How reproducible:
Always
Steps to Reproduce:
1. Enable AWS SDK install and create a C2S cluster 2. 3.
Actual results:
failed to create bootstrap instance profile: failed to create role (yunjiang-14c2a-t4wp7-bootstrap-role), bootstrap process failed
Expected results:
bootstrap process can be finished successfully.
Additional info:
No issue on terraform way.
I would like for the Azure storage account to be destroyed as part of the bootstrap destroy process, so that the storage account is not persisted for the life of the cluster which incurs costs and other management effort.
Description of criteria:
Please review the following PR: https://github.com/openshift/cluster-api-operator/pull/31
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Check scripts for the on-premise keepalived static pods only check the haproxy, which only directs to kube-apiserver pod. They do not take into consideration whether the control plane node has a healthy machine-config-server. This may be a problem because, in a failure scenario, it may be required to rebuild nodes and machine-config-server is required for that (so that ignitions are provided). One example is the etcd restore procedure (https://docs.openshift.com/container-platform/4.12/backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html). In our case, the following happened (I'd suggest reading the recovery procedure before this sequence of events): - Machine config server was healthy in the recovery control plane node but not in the other hosts. - At this point, we can only guarantee the health of the recovery control plane node because the non-recovery ones are to be replaced and must be removed first from the cluster (node objects deleted) so that OVN-Kubernetes control plane can work properly. - The keepalived check scripts were succeeding in the non-recovery control plane nodes because their haproxy pods were up and running. That is fine from kube-apiserver point of view, actually, but does not take machine config server into consideration. - As the machine-config-server was not reachable, provision of the new masters required by the procedure was impossible. In parallel to this bug, I'll be raising another bug to improve the restore procedure. Basically, asking to stop the keepalived static pods on the non-recovery control plane nodes. This would prevent the exact situation above. However, there are other situations where machine-config-server pods may be unhealthy and we should not just be manually stopping keepalived. In such cases, keepalived should take machine-config-server into consideration.
Version-Release number of selected component (if applicable):
4.12
How reproducible:
Under some failure scenarios, where machine-config-server is not healthy in one control plane node.
Steps to Reproduce:
1. Try to provision new machine for recovery. 2. 3.
Actual results:
Machine-config-server not serving because keepalived assigned the VIP to one node that doesn't have a working machine-config-server pod.
Expected results:
Keepalived to take machine-config-server health into consideration while doing failover.
Additional info:
Possible ideas to fix: - Create a check script for the machine-config-server check. It may have less weight than the kube-apiserver ones. - Include machine-config-server endpoint in the haproxy of the kube-apiservers.
Description of problem:
On an SNO a new CA certificate is not loaded after updating user-ca-bundle configmap and as a result the cluster cannot pull images from a registry with a certificate signed by the new CA.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Update ca bundle.crt replace with a new certificate if applicable ) in `user-ca-bundle` configmap under openshift-config namespace : * On the node ensure that /etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt was updated with the new certificate 2. Create a pod which uses an image from a registry that has its certificate signed by the new CA cert provided in ca-bundle.crt 3.
Actual results:
Pod fails to pull image *** Failed to pull image "registry.ztp-hub-00.mobius.lab.eng.rdu2.redhat.com:5000/centos/centos:8": rpc error: { code = Unknown desc = pinging container registry registry.ztp-hub-00.mobius.lab.eng.rdu2.redhat.com : 5000: Get "https://registry.ztp-hub-00.mobius.lab.eng.rdu2.redhat.com:5000/v2/": tls: failed to vierify certificate: x509: certificate signed by unknown authority * On the node try to reach the registry via curl [https://registry.ztp-hub-00.mobius.lab.eng.rdu2.redhat.com:5000|https://registry.ztp-hub-00.mobius.lab.eng.rdu2.redhat.com:5000/] ** certificate validation fails: curl [https://registry.ztp-hub-00.mobius.lab.eng.rdu2.redhat.com:5000|https://registry.ztp-hub-00.mobius.lab.eng.rdu2.redhat.com:5000/] curl: (60) SSL certificate problem: self-signed certificate More details here: [https://curl.se/docs/sslcerts.html] To be able to create a pod I had to ** Run `sudo update-ca-trust`. After that curl [https//registry.ztp-hub-00.mobius.lab.eng.rdu2.redhat.com:5000|https://registry.ztp-hub-00.mobius.lab.eng.rdu2.redhat.com:5000/] worked without issues but the pod creation still fails due to tls: failed to verify certificate: x509: certificate signed by unknown authority error ** Run `sudo systemctl restart crio`. After that the pod creation succeeded and could pull the image
Expected results:
Additional info:
Attaching must gather
In a 4.16.0-ec.1 cluster, scaling up a MachineSet with publicIP:true fails with:
$ oc -n openshift-machine-api get -o json machines.machine.openshift.io | jq -r '.items[] | select(.status.phase == "Failed") | .status.providerStatus.conditions[].message' | sort | uniq -c 1 googleapi: Error 403: Required 'compute.subnetworks.useExternalIp' permission for 'projects/openshift-gce-devel-ci-2/regions/us-central1/subnetworks/ci-ln-q4d8y8t-72292-msmgw-worker-subnet', forbidden
Seen in 4.16.0-ec.1. Not noticed in 4.15.0-ec.3. Fix likely needs a backport to 4.15 to catch up with OCPBUGS-26406.
Seen in the wild in a cluster after updating from 4.15.0-ec.3 to 4.16.0-ec.1. Reproduced in Cluster Bot on the first attempt, so likely very reproducible.
launch 4.16.0-ec.1 gcp Cluster Bot cluster (logs).
$ oc adm upgrade Cluster version is 4.16.0-ec.1 Upstream: https://api.integration.openshift.com/api/upgrades_info/graph Channel: candidate-4.16 (available channels: candidate-4.16) No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available. $ oc -n openshift-machine-api get machinesets NAME DESIRED CURRENT READY AVAILABLE AGE ci-ln-q4d8y8t-72292-msmgw-worker-a 1 1 1 1 60m ci-ln-q4d8y8t-72292-msmgw-worker-b 1 1 1 1 60m ci-ln-q4d8y8t-72292-msmgw-worker-c 1 1 1 1 60m ci-ln-q4d8y8t-72292-msmgw-worker-f 0 0 60m $ oc -n openshift-machine-api get -o json machinesets | jq -c '.items[].spec.template.spec.providerSpec.value.networkInterfaces' | sort | uniq -c 4 [{"network":"ci-ln-q4d8y8t-72292-msmgw-network","subnetwork":"ci-ln-q4d8y8t-72292-msmgw-worker-subnet"}] $ oc -n openshift-machine-api edit machineset ci-ln-q4d8y8t-72292-msmgw-worker-f # add publicIP $ oc -n openshift-machine-api get -o json machineset ci-ln-q4d8y8t-72292-msmgw-worker-f | jq -c '.spec.template.spec.providerSpec.value.networkInterfaces' [{"network":"ci-ln-q4d8y8t-72292-msmgw-network","publicIP":true,"subnetwork":"ci-ln-q4d8y8t-72292-msmgw-worker-subnet"}] $ oc -n openshift-machine-api scale --replicas 1 machineset ci-ln-q4d8y8t-72292-msmgw-worker-f $ sleep 300 $ oc -n openshift-machine-api get -o json machines.machine.openshift.io | jq -r '.items[] | select(.status.phase == "Failed") | .status.providerStatus.conditions[].message' | sort | uniq -c
1 googleapi: Error 403: Required 'compute.subnetworks.useExternalIp' permission for 'projects/openshift-gce-devel-ci-2/regions/us-central1/subnetworks/ci-ln-q4d8y8t-72292-msmgw-worker-subnet', forbidden
Successfully created machines.
I would expect the CredentialsRequest to ask for this permission, but it doesn't seem to. The old roles/compute.admin includes it, and it probably just needs to be added explicitly. Not clear how many other permissions might also need explicit listing.
Description of problem:
To bump some dependencies for CVE fixes, we added `replace` directives in the go.mod file. These dependencies have since moved way past the pinned version. We should drop the replaces before we run into problems from having deps pinned to versions that are too old. For example, I've seen PRs with the following diff: # golang.org/x/net v0.23.0 => golang.org/x/net v0.5.0 which is not really what we want.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Some dependencies are not upgraded because they are pinned.
Expected results:
Additional info:
Description of problem:
currently the mco updates its image registry certificate configmap by deleting and re-creating it on each MCO sync. Instead, we should be patching it
Version-Release number of selected component (if applicable):
4.14
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
This is to track the SDN specific issue in https://issues.redhat.com/browse/OCPBUGS-18389 4.14 nightly has a higher pod ready latency compared to 4.14 ec4 and 4.13.z in node-density (lite) test
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-09-11-201102
How reproducible:
Everytime
Steps to Reproduce:
1. Install a SDN cluster and scale up to 24 worker nodes, install 3 infra nodes and move monitoring, ingress, registry components to infra nodes. 2. Run node-density (lite) test with 245 pod per node 3. Compare the pod ready latency to 4.13.z, and 4.14 ec4
Actual results:
4.14 nightly has a higher pod ready latency compared to 4.14 ec4 and 4.13.10
Expected results:
4.14 should have similar pod ready latency compared to previous release
Additional info:
OCP Version | Flexy Id | Scale Ci Job | Grafana URL | Cloud | Arch Type | Network Type | Worker Count | PODS_PER_NODE | Avg Pod Ready (ms) | P99 Pod Ready (ms) | Must-gather |
4.14.0-ec.4 | 231559 | 292 | 087eb40c-6600-4db3-a9fd-3b959f4a434a | aws | amd64 | SDN | 24 | 245 | 2186 | 3256 | https://drive.google.com/file/d/1NInCiai7WWIIVT8uL-5KKeQl9CtQN_Ck/view?usp=drive_link |
4.14.0-0.nightly-2023-09-02-132842 | 231558 | 291 | 62404e34-672e-4168-b4cc-0bd575768aad | aws | amd64 | SDN | 24 | 245 | 58725 | 294279 | https://drive.google.com/file/d/1BbVeNrWzVdogFhYihNfv-99_q8oj6eCN/view?usp=drive_link |
With the new multus image provided by Dan Williams in https://issues.redhat.com/browse/OCPBUGS-18389, SDN 24 nodes's latency is similar to without the fix.
% oc -n openshift-network-operator get deployment.apps/network-operator -o yaml | grep MULTUS_IMAGE -A 1 - name: MULTUS_IMAGE value: quay.io/dcbw/multus-cni:informer % oc get pod -n openshift-multus -o yaml | grep image: | grep multus image: quay.io/dcbw/multus-cni:informer ....
OCP Version | Flexy Id | Scale Ci Job | Grafana URL | Cloud | Arch Type | Network Type | Worker Count | PODS_PER_NODE | Avg Pod Ready (ms) | P99 Pod Ready (ms) | Must-gather |
4.14.0-0.nightly-2023-09-11-201102 quay.io/dcbw/multus-cni:informer | 232389 | 314 | f2c290c1-73ea-4f10-a797-3ab9d45e94b3 | aws | amd64 | SDN | 24 | 245 | 61234 | 311776 | https://drive.google.com/file/d/1o7JXJAd_V3Fzw81pTaLXQn1ms44lX6v5/view?usp=drive_link |
4.14.0-ec.4 | 231559 | 292 | 087eb40c-6600-4db3-a9fd-3b959f4a434a | aws | amd64 | SDN | 24 | 245 | 2186 | 3256 | https://drive.google.com/file/d/1NInCiai7WWIIVT8uL-5KKeQl9CtQN_Ck/view?usp=drive_link |
4.14.0-0.nightly-2023-09-02-132842 | 231558 | 291 | 62404e34-672e-4168-b4cc-0bd575768aad | aws | amd64 | SDN | 24 | 245 | 58725 | 294279 | https://drive.google.com/file/d/1BbVeNrWzVdogFhYihNfv-99_q8oj6eCN/view?usp=drive_link |
Zenghui Shi Peng Liu request to modify the multus-daemon-config ConfigMap by removing readinessindicatorfile flag
Steps:
Now the readinessindicatorfile flag is removed and And all multus pods are restarted
% oc get cm multus-daemon-config -n openshift-multus -o yaml | grep readinessindicatorfile -c 0
Test Result: p99 is better compared to without the fix(remove readinessindicatorfile) but is stall worse than ec4, avg is still bad.
OCP Version | Flexy Id | Scale Ci Job | Grafana URL | Cloud | Arch Type | Network Type | Worker Count | PODS_PER_NODE | Avg Pod Ready (ms) | P99 Pod Ready (ms) | Must-gather |
4.14.0-0.nightly-2023-09-11-201102 quay.io/dcbw/multus-cni:informer and remove readinessindicatorfile flag | 232389 | 316 | d7a754aa-4f52-49eb-80cf-907bee38a81b | aws | amd64 | SDN | 24 | 245 | 51775 | 105296 | https://drive.google.com/file/d/1h-3JeZXQRO-zsgWzen6aNDQfSDqoKAs2/view?usp=drive_link |
Zenghui Shi Peng Liu request to set logLever to debug in additional to removing readinessindicatorfile flag
edit the cm to set "logLevel": "verbose" -> "debug" and restart all multus pods
Now the logLever is debug and And all multus pods are restarted
% oc get cm multus-daemon-config -n openshift-multus -o yaml | grep logLevel "logLevel": "debug", % oc get cm multus-daemon-config -n openshift-multus -o yaml | grep readinessindicatorfile -c 0
OCP Version | Flexy Id | Scale Ci Job | Grafana URL | Cloud | Arch Type | Network Type | Worker Count | PODS_PER_NODE | Avg Pod Ready (ms) | P99 Pod Ready (ms) | Must-gather |
4.14.0-0.nightly-2023-09-11-201102 quay.io/dcbw/multus-cni:informer and remove readinessindicatorfile flag and logLevel=debug | 232389 | 320 | 5d1d3e6a-bfa1-4a4b-bbfc-daedc5605f7d | aws | amd64 | SDN | 24 | 245 | 49586 | 105314 | https://drive.google.com/file/d/1p1PDbnqm0NlWND-komc9jbQ1PyQMeWcV/view?usp=drive_link |
Description of problem:
On a 4.14.5-fast channel cluster in ARO after the upgrade when the customer tried to add a new node the Machine Config was not applied and the node never joined the pool. This happens for every node and can only be remediated by SRE not the customer.
Version-Release number of selected component (if applicable):
4.14.5 -candidate
How reproducible:
Every time a node is added to the cluster at version.
Steps to Reproduce:
1. Install an ARO cluster 2. Upgrade it to 4.14 along fast channel 3. Add a node
Actual results:
message: >- could not Create/Update MachineConfig: Operation cannot be fulfilled on machineconfigs.machineconfiguration.openshift.io "99-worker-generated-kubelet": the object has been modified; please apply your changes to the latest version and try again status: 'False' type: Failure - lastTransitionTime: '2023-11-29T17:44:37Z' ~~~
Expected results:
Node is created and configured correctly.
Additional info:
MissingStaticPodControllerDegraded: static pod lifecycle failure - static pod: "kube-apiserver" in namespace: "openshift-kube-apiserver" for revision: 15 on node: "aro-cluster-REDACTED-master-0" didn't show up, waited: 4m45s
The goal is to collect metrics about user page interaction to better understand how customers use the console, and in turn develop a better experience.
acm_console_page_count:sum represents a counter for page visits across the main product pages.
Labels
The cardinality of the metric is at most 7 (7 page labels listed above - PrometheusRule is implemented to sum the page visit counts across Pods).
External consumers of MachineSets(), such as hive, need to be able to customize the client that queries the OpenStack cloud for trunk support.
OSASINFRA-3420, eliminating what looked like tech debt, removed that enablement, which had been added via a revert of a previous similar removal.
Reinstate the customizability, and include a docstring explanation to hopefully prevent it being removed again.
CI is almost perma failing on mtu migration in 4.14 (both SDN and OVN-Kubernetes):
Looks like the common issue is waiting for MCO times out:
+ echo '[2023-08-31T03:58:16+00:00] Waiting for final Machine Controller Config...' [2023-08-31T03:58:16+00:00] Waiting for final Machine Controller Config... + timeout 900s bash migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO migration field is not cleaned by MCO ...
Please review the following PR: https://github.com/openshift/cluster-api-provider-azure/pull/291
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The TaskRun duration diagram on the "Metrics" tab of pipeline is set to only show 4 TaskRuns in the legend regardless of the number of TaskRuns on the diagram.
Expected results:
All TaskRuns should be displayed in the legend.
The openshift/router repository vendors k8s.io/* v0.27.2. OpenShift 4.15 is based on Kubernetes 1.28.
4.15.
Always.
Check https://github.com/openshift/router/blob/release-4.15/go.mod.
The k8s.io/* packages are at v0.27.2.
The k8s.io/* packages are at v0.28.0 or newer.
Description of problem:
Failed to run auto OCP-57089 on a 4.14 azure platform, manually checked it, the created load-balancer service couldn't get an external-IP address
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-09-09-164123
How reproducible:
100% on the cluster
Steps to Reproduce:
1. Add a wait in the auto script, then run the case g.By("check if the lb services have obtained the EXTERNAL-IPs") regExp := "([0-9]+.[0-9]+.[0-9]+.[0-9]+)" time.Sleep(3600 * time.Second) % ./bin/extended-platform-tests run all --dry-run | grep 57089 | ./bin/extended-platform-tests run -f - 2. % oc get ns | grep e2e-test-router e2e-test-router-ingressclass-n2z2c Active 2m51s 3. It was pending in EXTERNAL-IP column for internal-lb-57089 service % oc -n e2e-test-router-ingressclass-n2z2c get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE external-lb-57089 LoadBalancer 172.30.198.7 20.42.34.61 28443:30193/TCP 3m6s internal-lb-57089 LoadBalancer 172.30.214.30 <pending> 29443:31507/TCP 3m6s service-secure ClusterIP 172.30.47.70 <none> 27443/TCP 3m13s service-unsecure ClusterIP 172.30.175.59 <none> 27017/TCP 3m13s % 4. % oc -n e2e-test-router-ingressclass-n2z2c get svc internal-lb-57089 -oyaml apiVersion: v1 kind: Service metadata: annotations: service.beta.kubernetes.io/azure-load-balancer-internal: "true" creationTimestamp: "2023-09-12T07:56:42Z" finalizers: - service.kubernetes.io/load-balancer-cleanup name: internal-lb-57089 namespace: e2e-test-router-ingressclass-n2z2c resourceVersion: "209376" uid: b163bc03-b1c6-4e7b-b4e1-c996e9d135f4 spec: allocateLoadBalancerNodePorts: true clusterIP: 172.30.214.30 clusterIPs: - 172.30.214.30 externalTrafficPolicy: Cluster internalTrafficPolicy: Cluster ipFamilies: - IPv4 ipFamilyPolicy: SingleStack ports: - name: https nodePort: 31507 port: 29443 protocol: TCP targetPort: 8443 selector: name: web-server-rc sessionAffinity: None type: LoadBalancer status: loadBalancer: {} %
Actual results:
internal-lb-57089 service couldn't get an external-IP address
Expected results:
internal-lb-57089 service can get an external-IP address
Additional info:
Please review the following PR: https://github.com/openshift/cloud-provider-nutanix/pull/23
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Using packages from k8s.io/kubernetes is not supported: https://github.com/kubernetes/kubernetes/issues/79384#issuecomment-505627280
This came about in this slack thread: https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1694210392218409?thread_ts=1694207119.447459&cid=C02CZNQHGN8
Description of problem:
Got undiagnosed panic: : Undiagnosed panic detected in pod expand_less0s{ pods/openshift-ovn-kubernetes_ovnkube-node-mtws2_ovnkube-controller_previous.log.gz:E0929 20:36:20.743430 5682 runtime.go:79] Observed a panic: &runtime.TypeAssertionError{_interface:(*runtime._type)(0x1f9aaa0), concrete:(*runtime._type)(0x20da3e0), asserted:(*runtime._type)(0x22d0600), missingMethod:""} (interface conversion: interface {} is cache.DeletedFinalStateUnknown, not *v1.NetworkAttachmentDefinition)} in this job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade/1707819503263420416
Version-Release number of selected component (if applicable):
4.15 ci payload: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.15.0-0.ci/release/4.15.0-0.ci-2023-09-29-180633 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/aggregated-gcp-ovn-upgrade-4.15-micro-release-openshift-release-analysis-aggregator/1707819513325555712
How reproducible:
This is the first time I noticed it on the 4.15
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/563
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-25897. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1.Creates CredentialsRequest including the spec.providerSpec.stsIAMRoleARN string. 2.Cloud Credential Operator could not populate Secret based on CredentialsRequest. $ oc get secret -A | grep test-mihuang #Secret not found. $ oc get CredentialsRequest -n openshift-cloud-credential-operator NAME AGE ... test-mihuang 44s 3.
Actual results:
Secret not create successfully.
Expected results:
Successfully created the secret on the hosted cluster.
Additional info:
Description of problem:
Installer now errors when attempting to use networkType: OpenShiftSDN; but the message still says "deprecated".
Version-Release number of selected component (if applicable):
4.15+
How reproducible:
100%
Steps to Reproduce:
1. Attempt to install 4.15+ with networkType: OpenShiftSDN Observe error in logs: time="2024-03-01T14:37:25Z" level=error msg="failed to fetch Master Machines: failed to load asset \"Install Config\": failed to create install config: invalid \"install-config.yaml\" file: networking.networkType: Invalid value: \"OpenShiftSDN\": networkType OpenShiftSDN is deprecated, please use OVNKubernetes"
Actual results:
Observe error in logs: time="2024-03-01T14:37:25Z" level=error msg="failed to fetch Master Machines: failed to load asset \"Install Config\": failed to create install config: invalid \"install-config.yaml\" file: networking.networkType: Invalid value: \"OpenShiftSDN\": networkType OpenShiftSDN is deprecated, please use OVNKubernetes"
Expected results:
A message more like: Observe error in logs: time="2024-03-01T14:37:25Z" level=error msg="failed to fetch Master Machines: failed to load asset \"Install Config\": failed to create install config: invalid \"install-config.yaml\" file: networking.networkType: Invalid value: \"OpenShiftSDN\": networkType OpenShiftSDN is not supported, please use OVNKubernetes"
Additional info:
See thread
Description
Windows host process containers are in alpha, as of Kubernetes 1.22. With this new feature, it should be possible to add `oc debug` functionality for Windows nodes. This would help us as developers, and has the potential to be useful for debugging customer issues as well.
Description of problem:
I have a customer trying to deploy 4.14.1 IPI on vsphere and running into:
time="2023-11-14T14:30:35+01:00" level=fatal msg="failed to fetch Terraform Variables: failed to generate asset \"Terraform Variables\": network '/Datacenter_name/VLAN2506' not found
A similar configuration works fine with OCP 4.13
The network profile VLAN2506is available in the given network list of installer survey.
The network is available inside '/datacenter/network/VLAN2506' when checked with govc command.
Found this https://bugzilla.redhat.com/show_bug.cgi?id=2063829 however it was reported when the network is nested under a folder however here the network is inside DC.
We tried this with 4,14 installer in our lab env however did not face this issue.
Version-Release number of selected component (if applicable):
4.14.1
This is a clone of issue OCPBUGS-28388. The following is the description of the original issue:
—
Description of problem:
The status controller of CCO reconciles 500+ times/h on average on a resting 6-node mint-mode OCP cluster on AWS.
Steps to Reproduce:
1. Install a 6-node mint-mode OCP cluster on AWS 2. Do nothing with it and wait for a couple of hours 3. Plot the following metric in the metrics dashboard of OCP console: rate(controller_runtime_reconcile_total{controller="status"}[1h]) * 3600
Actual results:
500+ reconciles/h on a resting cluster
Expected results:
12-50 reconciles/h on a resting cluster Note: the reconcile() function always requeues after 5min so the theoretical minimum is 12 reconciles/h
Description of problem:
After control plane release upgrade, in the guest cluster pod 'tuned' uses control plane release image
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. create a cluster in 4.14.0-0.ci-2023-09-06-180503 2. control plane release upgrade to 4.14-2023-09-07-180503 3. in the guest cluster check container image in pod tuned
Actual results:
pod tuned uses control plane release image 4.14-2023-09-07-180503
Expected results:
pod tuned uses release image 4.14.0-0.ci-2023-09-06-180503
Additional info:
After controlplane release upgrade, in control plane namespace, cluster-node-tuning-operator uses control plane release image: jiezhao-mac:hypershift jiezhao$ oc get pods cluster-node-tuning-operator-6dc549ffdf-jhj2k -n clusters-jie-test -ojsonpath='{.spec.containers[].name}{"\n"}' cluster-node-tuning-operator jiezhao-mac:hypershift jiezhao$ oc get pods cluster-node-tuning-operator-6dc549ffdf-jhj2k -n clusters-jie-test -ojsonpath='{.spec.containers[].image}{"\n"}' registry.ci.openshift.org/ocp/4.14-2023-09-07-180503@sha256:60bd6e2e8db761fb4b3b9d68c1da16bf0371343e3df8e72e12a2502640173990
Please review the following PR: https://github.com/openshift/machine-api-operator/pull/1167
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/201
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-network-operator/pull/2133
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Dependabot is not updating dependencies. Investigate & fix.
Description of problem:
With the fix for BZ 2079803 [1] we have introduced a backup trigger on every z-release (instead of every y-release). Sadly we have not updated the CVO [2] logic along with it, which effectively stops the upgrade until a snapshot was taken. Currently we have a split state machine (thanks Trevor): ... today we have this for minor updates: 1. User bumps ClusterVersion spec asking for a minor update 2. CVO checks for a recent etcd backup. Until it is available, we refuse to accept the retarget request. 3. Once the etcd backup is available (assuming no other precondition issues), we accept the retarget and start updating. While for patch updates: 1. User bumps ClusterVersion spec asking for a minor update. 2. CVO accepts the retarget, sets status.desired , and starts in on the update In the latter two cases, it might be that the CEO takes a snapshot while the upgrade is already running (race condition). This creates an inconsistent snapshot, which on restore would just re-attempt to execute the (botched) upgrade. [1] https://github.com/openshift/cluster-etcd-operator/pull/835 [2] https://github.com/openshift/cluster-version-operator/blob/master/pkg/payload/precondition/clusterversion/etcdbackup.go#L76-L77
Version-Release number of selected component (if applicable):
any OCP > 4.10
How reproducible:
almost always (race condition between CEO and CVO)
Steps to Reproduce:
1. trigger a z-upgrade 2. observe when the etcd backup is taken, it might happen after the upgrade is already in progress
Actual results:
The snapshot that was created contains parts of the newly upgraded OCP (CVO CRD or any other operator state).
Expected results:
The snapshot should not contain any information that could come through with the z-upgrade.
Additional info:
Either the CVO should also wait on z-upgrades to ensure the snapshots are consistently on a pre-upgrade state, or we revert the z-stream upgrade behavior again.
—
William Caban and our team decided to entirely remove the controller.
W. Trevor King to drop the requirement in CVO.
Michael Burke reviewed the plugin API documentation as part of https://github.com/openshift/openshift-docs/pull/53103. We should update the ts-doc comments in the openshift/console repo based on this review.
Please review the following PR: https://github.com/openshift/cluster-api-provider-aws/pull/477
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-25830. The following is the description of the original issue:
—
Component Readiness has found a potential regression in [sig-arch] events should not repeat pathologically for ns/openshift-operator-lifecycle-manager.
Probability of significant regression: 100.00%
Sample (being evaluated) Release: 4.15
Start Time: 2023-12-05T00:00:00Z
End Time: 2023-12-11T23:59:59Z
Success Rate: 94.30%
Successes: 248
Failures: 15
Flakes: 0
Base (historical) Release: 4.14
Start Time: 2023-10-04T00:00:00Z
End Time: 2023-10-31T23:59:59Z
Success Rate: 100.00%
Successes: 730
Failures: 0
Flakes: 0
This is a clone of issue OCPBUGS-385. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Recent introductions of a validation within the hypershift operator's webhook conflicts with the UI's ability to create HCP clusters. Previously the pull secret was not required to be posted before an HC or NP, but with a recent change, the pull secret is required because the pull secret is used to validate the release image payload. This issue is isolated to 4.15
Version-Release number of selected component (if applicable):
4.15
How reproducible:
100% attempt to post a HC before the pull secret is posted and the HC will be rejected. The expected outcome is that it should be able to post the pull secret for a HC after the HC is posted, and the controller should be eventually consistent to this change.
The cluster-version operator is very chatty, and this can cause problems in clusters where logs are shipped off to external storage. We worked on this in rbhz#2034493, which taught 4.10 and later to move to level 2 logging, mostly to drop the client-side throttling messages. And we have been pushing OTA-923 to make logging tunable, to avoid the need to make "will we want to hear about this?" decisions in one place for all clusters at all times. But there is interest in reducing the amount of logging in older releases in ways that do not require a tunable knob, and this bug tracks another step in that direction: the Running sync / Done syncing messages.
h2 Version-Release number of selected component (if applicable):
All 4.y releases log these lines at high volume, but 4.10 and earlier are end-of-life, and 4.11 and 4.12 are in maintenance mode.
Every time.
1. Install a cluster.
2. Wait at least 30m since install or the most recent update completes, because we want the CVO to be chatty during those exciting times, and this bug is about steady-state log volume.
3. Collect CVO logs for the past 30m: oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --since=40m --tail=-1 >cvo.log.
$ oc adm upgrade Cluster version is 4.13.21 ... $ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --since=40m --tail=-1 > cvo.log $ grep -o 'apply.*in state.*' cvo.log | uniq -c 10 apply: 4.13.21 on generation 77 in state Reconciling at attempt 0 $ wc cvo.log 20043 242930 3071956 cvo.log $ sed -n 's/^.* \([^ ]*[.]go:[0-9]*\).*/\1/p' cvo.log | sort | uniq -c | sort -n | tail -n5 194 sync_worker.go:490 314 sync_worker.go:978 807 task_graph.go:477 7971 sync_worker.go:1007 7973 sync_worker.go:987 $ grep 'sync_worker.go:987' cvo.log | tail -n2 I1116 22:10:08.739999 1 sync_worker.go:987] Running sync for serviceaccount "openshift-cloud-credential-operator/cloud-credential-operator" (271 of 842) I1116 22:10:08.785081 1 sync_worker.go:987] Running sync for flowschema "openshift-apiserver" (457 of 842) $ grep 'sync_worker.go:1007' cvo.log | tail -n2 I1116 22:10:08.739967 1 sync_worker.go:1007] Done syncing for configmap "openshift-cloud-credential-operator/cco-trusted-ca" (270 of 842) I1116 22:10:08.785043 1 sync_worker.go:1007] Done syncing for flowschema "openshift-apiserver-sar" (456 of 842)
So that's 3071956 bytes / 30 minutes * 60 minutes / 1 hour ~= 6 MB / hour, the bulk of which is Running sync and Done syncing logs.
$ grep -v 'sync_worker.go:\(987\|1007\)]' cvo.log | wc 4099 51602 861709
So something closer to 861709 bytes / 30 minutes * 60 minutes / 1 hour ~= 2 MB / hour would be acceptable.
The CVO has a randomized sleep to cool off between sync cycles, and per-sync-cycle log volume will depend on (among other things) what that CVO container happened to choose for that sleep.
Please review the following PR: https://github.com/openshift/alibaba-cloud-csi-driver/pull/42
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-29645. The following is the description of the original issue:
—
Description of problem:
When a customer certificate and sre certificate are configured and approved, revocation of customer certificate causes access to the cluster using kubeconfig with sre cert to be denied
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Create a cluster 2. Configure a customer cert and a sre cert, they are approved 3. Revoke a customer cert, access to the cluster using kubeconfig with sre cert gets denied
Actual results:
Revoke a customer cert, access to the cluster using kubeconfig with sre cert gets denied
Expected results:
Revoke a customer cert, access to the cluster using kubeconfig with sre cert succeeds
Additional info:
Please review the following PR: https://github.com/openshift/cluster-version-operator/pull/970
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
In install-config.yaml, set controlplane type to size in vm family standardEIBDSv5Family and standardEIBSv5Family, get below error from installer when creating cluster ---------------------------- 09-07 17:55:57.613 level=error msg=Error: creating Linux Virtual Machine: (Name "jima-test-wlgrr-bootstrap" / Resource Group "jima-test-wlgrr-rg"): compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="InvalidParameter" Message="The VM size 'Standard_E112ibs_v5' cannot boot with OS image or disk. Please check that disk controller types supported by the OS image or disk is one of the supported disk controller types for the VM size 'Standard_E112ibs_v5'. Please query sku api at https://aka.ms/azure-compute-skus to determine supported disk controller types for the VM size." Target="vmSize" Checked that both vm families only support diskControllerTypes NVMe { "name": "DiskControllerTypes", "value": "NVMe" }, From https://github.com/hashicorp/terraform-provider-azurerm/issues/22058, seems that it does not support to set disk controller types. Suggest to add validation for those family as what is done in https://github.com/openshift/installer/pull/6733
Version-Release number of selected component (if applicable):
4.14 nightly build
How reproducible:
always
Steps to Reproduce:
1. prepare install-config, set vm size in family standardEIBDSv5Family and standardEIBSv5Family for controlplane 2. create cluster 3.
Actual results:
Installer failed with error
Expected results:
Installer should have pre-check for those unsupported instance types and exit with error message
Additional info:
This is a clone of issue OCPBUGS-26513. The following is the description of the original issue:
—
Description of problem:
oc-mirror with v2 will create the idms file as output , but the source is like : apiVersion: config.openshift.io/v1 kind: ImageDigestMirrorSet metadata: creationTimestamp: null name: idms-2024-01-08t04-19-04z spec: imageDigestMirrors: - mirrors: - ec2-3-144-29-184.us-east-2.compute.amazonaws.com:5000/ocp2/openshift source: localhost:55000/openshift - mirrors: - ec2-3-144-29-184.us-east-2.compute.amazonaws.com:5000/ocp2/openshift-release-dev source: quay.io/openshift-release-dev status: {} The source should always be the origin registry like :quay.io/openshift-release-dev
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. run the command with v2 : apiVersion: mirror.openshift.io/v1alpha2 kind: ImageSetConfiguration mirror: platform: channels: - name: stable-4.14 minVersion: 4.14.3 maxVersion: 4.14.3 graph: true `oc-mirror --config config.yaml file://out --v2` `oc-mirror --config config.yaml --from file://out --v2 docker://xxxx:5000/ocp2` 2. check the idms file
Actual results:
2. cat idms-2024-01-08t04-19-04z.yaml apiVersion: config.openshift.io/v1 kind: ImageDigestMirrorSet metadata: creationTimestamp: null name: idms-2024-01-08t04-19-04z spec: imageDigestMirrors: - mirrors: - xxxx.com:5000/ocp2/openshift source: localhost:55000/openshift - mirrors: - xxxx.com:5000/ocp2/openshift-release-dev source: quay.io/openshift-release-dev
Expected results:
The source should not be localhost:55000, should be like the origin registry.
Additional info:
Please review the following PR: https://github.com/openshift/cluster-machine-approver/pull/217
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Update the owners file in openshift-state-metric repository, add new team mates in, move old team mates out.
Please review the following PR: https://github.com/openshift/cluster-version-operator/pull/1000
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
A case was found recently (see https://github.com/openshift/machine-os-images/pull/27) where the rhcos image version stored within the machine-os-images was different than the one reported in the installer rhcos metadata.
This sync is particular relevant for the agent-based installer, since the create image command logic could fetch the base ISO either from the machine-os-images content either from a direct download, depending on the availability or not of the oc command in the current execution environment.
Even though this scenario is very unlikely to happen in production, a missing sync between the machine-os-images and the installer metadata may produce different results depending on the environmental condition, and moreover can hide silently severe issues.
Description of problem:
tested https://github.com/openshift/console/pull/13114 with cluster-bot
launch 4.15,openshift/console#13114 gcp
the below functions are unavailable, see recording: https://drive.google.com/file/d/1yBS_xGWgJwfIoOdLdIjZ6riSL_cOARrb/view?usp=sharing
1. time interval drop-down
2. Actions drop-down:
Add query
Collapse all query tables
3. Add query button
4. kebab menu:
Disable query
Delete query
Duplicate query
5. disable/enable query toggle button
NOTE: also checked on 4.15.0-0.nightly-arm64-2023-09-19-235618, no such issues
Version-Release number of selected component (if applicable):
test https://github.com/openshift/console/pull/13114 with cluster-bot
How reproducible:
always
Steps to Reproduce:
1. regression testing for console PR 13114 2. 3.
Actual results:
console PR 13114 makes many functions under "Observe > Metrics" unavailable
Expected results:
no issue
Please review the following PR: https://github.com/openshift/cluster-autoscaler-operator/pull/302
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Image pulls fail with http status 504, gateway timeout until image registry pods are restarted.
Version-Release number of selected component (if applicable):
4.13.12
How reproducible:
Intermittent
Steps to Reproduce:
1. 2. 3.
Actual results:
Images can't be pulled: podman pull registry.ci.openshift.org/ci/applyconfig:latest Trying to pull registry.ci.openshift.org/ci/applyconfig:latest... Getting image source signatures Error: reading signatures: downloading signatures for sha256:83c1b636069c3302f5ba5075ceeca5c4a271767900fee06b919efc3c8fa14984 in registry.ci.openshift.org/ci/applyconfig: received unexpected HTTP status: 504 Gateway Time-out Image registry pods contain errors: time="2023-09-01T02:25:39.596485238Z" level=warning msg="error authorizing context: access denied" go.version="go1.19.10 X:strictfipsruntime" http.request.host=registry.ci.openshift.org http.request.id=3e805818-515d-443f-8d9b-04667986611d http.request.method=GET http.request.remoteaddr=18.218.67.82 http.request.uri="/v2/ocp/4-dev-preview/manifests/sha256:caf073ce29232978c331d421c06ca5c2736ce5461962775fdd760b05fb2496a0" http.request.useragent="containers/5.24.1 (github.com/containers/image)" vars.name=ocp/4-dev-preview vars.reference="sha256:caf073ce29232978c331d421c06ca5c2736ce5461962775fdd760b05fb2496a0"
Expected results:
Image registry does not return gateway timeouts
Additional info:
Must gather(s) attached, additional information in linked OHSS ticket.
Issue 29 from https://docs.google.com/spreadsheets/d/1TR3ENY-GE_LQL9F-xH6NahtRHdu6IrYGhLqDDA8y0EI/edit#gid=1035185624
Form View and Yaml view switches are aligned horizontally before, now it is vertical
This happens at least on
Screenshot: https://drive.google.com/file/d/1nzFHCeorlVIMbwlnjzEc1fCW0GXQa1KT/view
This is a clone of issue OCPBUGS-25687. The following is the description of the original issue:
—
Description of problem:
The [Jira:"Network / ovn-kubernetes"] monitor test pod-network-avalibility setup test is a frequent offender in the OpenStack CSI jobs. We're seeing it fail on 4.14 up to 4.16.
Example of failed job.
Example of successful job.
It seems like the 1 min timeout is too short and does not give enough time for the pods backing the service to come up.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
This fix contains the following changes coming from updated version of kubernetes up to v1.28.4:
Changelog:
v1.28.4: https://github.com/kubernetes/kubernetes/blob/release-1.28/CHANGELOG/CHANGELOG-1.28.md#changelog-since-v1283
Description of problem:
CredentialsRequest for Azure AD workload identity contains unnecessary permissions under `virtualMachines/extensions`. Specifically write and delete.
Version-Release number of selected component (if applicable):
4.14.0+
How reproducible:
Every time
Steps to Reproduce:
1. Create a cluster without the CredentialsRequest permissions mentioned 2. Scale machineset 3. See no permission errors
Actual results:
We have unnecessary permissions, but still no errors
Expected results:
Still no permission errors after removal.
Additional info:
RHCOS doesn't leverage virtual machine extensions. It appears as though the code path is dead.
This is a clone of issue OCPBUGS-27446. The following is the description of the original issue:
—
Steps to Reproduce:
1. Install a cluster using Azure Workload Identity 2. Check the value of the cco_credentials_mode metric
Actual results:
mode = manual
Expected results:
mode = manualpodidentity
Additional info:
The cco_credentials_mode metric reports manualpodidentity mode for an AWS STS cluster.
This is a clone of issue OCPBUGS-29249. The following is the description of the original issue:
—
Observed during testing of candidate-4.15 image as of 2024-02-08.
This is an incomplete report as I haven't verified the reproducer yet or attempted to get a must-gather. I have observed this multiple times now, so I am confident it's a thing. I can't be confident that the procedure described here reliably reproduces it, or that all the described steps are required.
I have been using MCO to apply machine config to masters. This involves a rolling reboot of all masters.
During a rolling reboot I applied an update to CPMS. I observed the following sequence of events:
At this point there were only 2 nodes in the cluster:
and machines provisioning:
Description of problem:
When we encounter the HostAlreadyClaimed issue, the error message is pointing to the wrong route name.
Version-Release number of selected component (if applicable):
OCP v4.12.z
How reproducible:
Frequently
Steps to Reproduce:
- Created three routes with the similar hosts, one without the path and other eith the paths defined. # oc get routes NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD route1 httpd-example-path-based-routes.apps.firstcluster.lab.upshift.rdu2.redhat.com httpd-example web edge None route2 httpd-example-path-based-routes.apps.firstcluster.lab.upshift.rdu2.redhat.com /path httpd-example web edge None route3 HostAlreadyClaimed /path httpd-example web edge None <--------------- - Got 'HostAlreadyClaimed' error for the third route 'route3' which is expected because the path and the hostname of 'route2' & route3' are the same. - In the route description, we could see that the first route that is 'route1' is reported to be the older route for the host but we expect it should report 'route2' because the hostname and paths are similar for the route2 and route3. # oc describe route route3 Name: route3 Namespace: path-based-routes Created: 14 seconds ago Labels: app=httpd-example template=httpd-example Annotations: <none> Requested Host: httpd-example-path-based-routes.apps.firstcluster.lab.upshift.rdu2.redhat.com rejected by router default: (host router-default.apps.firstcluster.lab.upshift.rdu2.redhat.com)HostAlreadyClaimed (14 seconds ago) route route1 already exposes httpd-example-path-based-routes.apps.firstcluster.lab.upshift.rdu2.redhat.com and is older <---------------- Path: /path TLS Termination: edge Insecure Policy: <none> Endpoint Port: web Service: httpd-example Weight: 100 (100%) Endpoints: 10.1.2.3:8080 - However, deleting the 'route2' resolves the issue.
Actual results:
Error messages for 'HostAlreadyClainmed' issue should consider the route name to be reported on the basis of Hostname and paths.
Expected results:
Only hostname is taken into consideration where route's path should be checked as well and then the appropiate route name should be reported in the error.
Please review the following PR: https://github.com/openshift/baremetal-runtimecfg/pull/274
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-26772. The following is the description of the original issue:
—
Description of problem:
When cloning a PVC of 60GiB size, the system autofills the remote size to be 8192 PeB. This size cannot be changed in the UI before starting the clone.
Version-Release number of selected component (if applicable):
CNV - 4.14.3
How reproducible:
always
Steps to Reproduce:
1.Create a VM with a PVC of 60Gib 2.Power off the VM 3.As a cluster admin, clone the 60GiB PVC (Storage -> PersistentVolumeClaims -> Kebab menu next to pvc
Actual results:
The system tries to clone the 60 GiB PVC as a 8192 PeB
Expected results:
A new pvc of the 60 GiB
Additional info:
This seems like the closed BZ 2177979.I will upload a screenshot of the UI. Here is the yaml for the original pvc. apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: cdi.kubevirt.io/storage.bind.immediate.requested: "true" cdi.kubevirt.io/storage.contentType: kubevirt cdi.kubevirt.io/storage.pod.phase: Succeeded cdi.kubevirt.io/storage.populator.progress: 100.0% cdi.kubevirt.io/storage.preallocation.requested: "false" cdi.kubevirt.io/storage.usePopulator: "true" pv.kubernetes.io/bind-completed: "yes" pv.kubernetes.io/bound-by-controller: "yes" volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com volume.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com creationTimestamp: "2023-12-05T17:34:19Z" finalizers:kubernetes.io/pvc-protectionprovisioner.storage.kubernetes.io/cloning-protection labels: app: containerized-data-importer app.kubernetes.io/component: storage app.kubernetes.io/managed-by: cdi-controller app.kubernetes.io/part-of: hyperconverged-cluster app.kubernetes.io/version: 4.14.0 kubevirt.io/created-by: 60f46f91-2db3-4118-aaba-b1697b29c496 name: win2k19-base namespace: base-images ownerReferences:apiVersion: cdi.kubevirt.io/v1beta1 blockOwnerDeletion: true controller: true kind: DataVolume name: win2k19-base uid: 8980e7b7-ce0b-47b4-a7e4-f4c79e984ebe resourceVersion: "697047" uid: fccb0aa9-8541-4b51-b49e-ddceaa22b68c spec: accessModes:ReadWriteMany dataSource: apiGroup: cdi.kubevirt.io kind: VolumeImportSource name: volume-import-source-8980e7b7-ce0b-47b4-a7e4-f4c79e984ebe dataSourceRef: apiGroup: cdi.kubevirt.io kind: VolumeImportSource name: volume-import-source-8980e7b7-ce0b-47b4-a7e4-f4c79e984ebe resources: requests: storage: "64424509440" storageClassName: ocs-storagecluster-ceph-rbd volumeMode: Block volumeName: pvc-dbfc9fe9-5677-469d-9402-c2f3a22dab3f status: accessModes:ReadWriteMany capacity: storage: 60Gi phase: Bound Here is the yaml for the cloning pvc. apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com volume.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com creationTimestamp: "2023-12-06T14:24:07Z" finalizers:kubernetes.io/pvc-protection name: win2k19-base-clone namespace: base-images resourceVersion: "1551054" uid: f72665c3-6408-4129-82a2-e663d8ecc0cc spec: accessModes:ReadWriteMany dataSource: apiGroup: "" kind: PersistentVolumeClaim name: win2k19-base dataSourceRef: apiGroup: "" kind: PersistentVolumeClaim name: win2k19-base resources: requests: storage: "9223372036854775807" storageClassName: ocs-storagecluster-ceph-rbd volumeMode: Block status: phase: Pending
Description of problem:
Creating the installation manifests results in a bogus warning message about discarding existing manifests, even though none exist.
Version-Release number of selected component (if applicable):
Tested on 4.15 dev, but the problem appears to have been present since 4.2.
How reproducible:
100%
Steps to Reproduce:
1. Start with an empty dir containing only an install-config.yaml with platform: baremetal 2. Run "openshift-install create manifests" 3. There is no step 3
Actual results:
INFO Consuming Install Config from target directory WARNING Discarding the Openshift Manifests that was provided in the target directory because its dependencies are dirty and it needs to be regenerated INFO Manifests created in: test/manifests and test/openshift
Expected results:
INFO Consuming Install Config from target directory INFO Manifests created in: test/manifests and test/openshift
Additional info:
The issue is due to multiple assets referencing the same files.
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/117
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
As this shows tls: bad certificate from kube-apiserver operator, for example, https://reportportal-openshift.apps.ocp-c1.prod.psi.redhat.com/ui/#prow/launches/all/470214, checked its must-gather: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-aws-ipi-imdsv2-fips-f14/1726036030588456960/artifacts/aws-ipi-imdsv2-fips-f14/gather-must-gather/artifacts/
MacBook-Pro:~ jianzhang$ omg logs prometheus-operator-admission-webhook-6bbdbc47df-jd5mb | grep "TLS handshake" 2023-11-27 10:11:50.687 | WARNING | omg.utils.load_yaml:<module>:10 - yaml.CSafeLoader failed to load, using SafeLoader 2023-11-19T00:57:08.318983249Z ts=2023-11-19T00:57:08.318923708Z caller=stdlib.go:105 caller=server.go:3215 msg="http: TLS handshake error from 10.129.0.35:48334: remote error: tls: bad certificate" 2023-11-19T00:57:10.336569986Z ts=2023-11-19T00:57:10.336505695Z caller=stdlib.go:105 caller=server.go:3215 msg="http: TLS handshake error from 10.129.0.35:48342: remote error: tls: bad certificate" ... MacBook-Pro:~ jianzhang$ omg get pods -A -o wide | grep "10.129.0.35" 2023-11-27 10:12:16.382 | WARNING | omg.utils.load_yaml:<module>:10 - yaml.CSafeLoader failed to load, using SafeLoader openshift-kube-apiserver-operator kube-apiserver-operator-f78c754f9-rbhw9 1/1 Running 2 5h27m 10.129.0.35 ip-10-0-107-238.ec2.internal
for more information slack - https://redhat-internal.slack.com/archives/CC3CZCQHM/p1700473278471309
This is a clone of issue OCPBUGS-23199. The following is the description of the original issue:
—
Description of problem:
During a pod deletion, the whereabouts reconciler correctly detects the pod deletion but it errors out claiming that the IPPool is not found.However, when checking the audit logs, we can see no deletion, no re-creation and we can even see successful "patch" and "get" requests to the same IPPool. This means that the IPPool was never deleted and properly accessible at the time of the issue, so the error in the reconciler looks like it made some mistake while retrieving the IPPool.
Version-Release number of selected component (if applicable):
4.12.22
How reproducible:
Sometimes
Steps to Reproduce:
1.Delete pod 2. 3.
Actual results:
Error in whereabouts reconciler. New pods cannot using additional networks with whereabouts IPAM plugin cannot have IPs allocated due to wrong cleanup.
Expected results:
Additional info:
Description of problem:
According to https://cloud.google.com/docs/authentication/provide-credentials-adc#local-key the default for application credentials is to set GOOGLE_APPLICATION_CREDENTIALS. currently this var is missing from the list of environment variables checked.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
When a user selects a supported-but-not-recommended update target, it's currently rendered as a DropdownWithSwitch that is collapsed by default. That forces the user to perform an extra click to see the message explaining the risk they are considering accepting. We should remove the toggle and always expand that message, because understanding the risk is a critical part of deciding whether you accept it.
Since console landed support for conditional update risks. Not a big enough deal to backport that whole way.
Every time.
OTA-520 explains how to create dummy data for testing the conditional update UX pre-merge and/or on nightly builds that are not part of the usual channels yet.
but without the down-v, because the text should not be collapsible.
When creating cluster on existing vnet on MAG and ASH, installer failed and threw out the error:
11-27 13:42:03.944 level=info msg=Creating infrastructure resources... 11-27 13:42:04.502 level=fatal msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to get the virtual network "jima27maga-vnet": GET https://management.azure.com/subscriptions/8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7/resourceGroups/jima27maga-rg/providers/Microsoft.Network/virtualNetworks/jima27maga-vnet 11-27 13:42:04.503 level=fatal msg=-------------------------------------------------------------------------------- 11-27 13:42:04.503 level=fatal msg=RESPONSE 404: 404 Not Found 11-27 13:42:04.503 level=fatal msg=ERROR CODE: SubscriptionNotFound 11-27 13:42:04.503 level=fatal msg=-------------------------------------------------------------------------------- 11-27 13:42:04.503 level=fatal msg={ 11-27 13:42:04.503 level=fatal msg= "error": { 11-27 13:42:04.503 level=fatal msg= "code": "SubscriptionNotFound", 11-27 13:42:04.503 level=fatal msg= "message": "The subscription '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7' could not be found." 11-27 13:42:04.504 level=fatal msg= } 11-27 13:42:04.504 level=fatal msg=} 11-27 13:42:04.504 level=fatal msg=-------------------------------------------------------------------------------- 11-27 13:42:04.504 level=fatal
During destroying cluster, got below error when removing shared tags.
$ ./openshift-install destroy cluster --dir ipi --log-level debug DEBUG OpenShift Installer 4.15.0-0.nightly-2023-11-25-110147 DEBUG Built from commit 1ea1a54a197501cdbda71196c7fac744f835217f INFO Credentials loaded from file "/home/fedora/.azure/osServicePrincipal_gov.json" DEBUG deleting public records WARNING no DNS records found: either they were already deleted or the service principal lacks permissions to list them DEBUG deleting resource group INFO deleted resource group=jima761122c-264bb-rg DEBUG deleting application registrations DEBUG failed to query resources with shared tag: POST https://management.azure.com/providers/Microsoft.ResourceGraph/resources DEBUG -------------------------------------------------------------------------------- DEBUG RESPONSE 400: 400 Bad Request DEBUG ERROR CODE: BadRequest DEBUG -------------------------------------------------------------------------------- DEBUG { DEBUG "error": { DEBUG "code": "BadRequest", DEBUG "message": "Please provide below info when asking for support: timestamp = 2023-11-27T06:25:26.3355852Z, correlationId = b4dfd555-86b0-4e68-aec7-f75cd7307c69.", DEBUG "details": [ DEBUG { DEBUG "code": "NoValidSubscriptionsInQueryRequest", DEBUG "message": "There must be at least one subscription that is eligible to contain resources. Given: '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7'." DEBUG } DEBUG ] DEBUG } DEBUG } DEBUG -------------------------------------------------------------------------------- DEBUG FATAL Failed to destroy cluster: failed to remove shared tags: failed to query resources with shared tag: POST https://management.azure.com/providers/Microsoft.ResourceGraph/resources FATAL -------------------------------------------------------------------------------- FATAL RESPONSE 400: 400 Bad Request FATAL ERROR CODE: BadRequest FATAL -------------------------------------------------------------------------------- FATAL { FATAL "error": { FATAL "code": "BadRequest", FATAL "message": "Please provide below info when asking for support: timestamp = 2023-11-27T06:25:26.3355852Z, correlationId = b4dfd555-86b0-4e68-aec7-f75cd7307c69.", FATAL "details": [ FATAL { FATAL "code": "NoValidSubscriptionsInQueryRequest", FATAL "message": "There must be at least one subscription that is eligible to contain resources. Given: '8fe0c1b4-8b05-4ef7-8129-7cf5680f27e7'." FATAL } FATAL ] FATAL } FATAL } FATAL -------------------------------------------------------------------------------- FATAL
Issue should be introduced by https://github.com/openshift/installer/pull/7611/, since all accepted nightly builds on 4.15 contains PR#7611, it is unable to verify on previous payloads, but checked Prow CI jobs, installation succeeded with 4.15.0-0.nightly-2023-11-20-045323.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-11-25-110147
How reproducible:
always
Steps to Reproduce:
1. Install cluster on existing vnet on MAG and ASH
Actual results:
Installation failed.
Expected results:
Installation succeeded.
Please review the following PR: https://github.com/openshift/sdn/pull/592
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/installer/pull/7816
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
Description of problem:
Build timing test is failing due to faster run times on Bare Metal
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. run [sig-builds][Feature:Builds][timing] capture build stages and durations should record build stages and durations for docker 2. 3.
Actual results:
{ fail [github.com/openshift/origin/test/extended/builds/build_timing.go:101]: Stage PushImage ran for 95, expected greater than 100ms Expected <bool>: true to be false Ginkgo exit error 1: exit with code 1}
Expected results:
Test should pass
Additional info:
Please review the following PR: https://github.com/openshift/ironic-image/pull/438
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The host doesn't power off upon removal during scale down.
Version: 4.4.0-0.nightly-2020-01-09-013524
Steps to reproduce:
Starting with 3 workers:
[kni@worker-2 ~]$ oc get bmh -n openshift-machine-api
NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR
openshift-master-0 OK externally provisioned ocp-edge-cluster-master-0 ipmi://192.168.123.1:6230 true
openshift-master-1 OK externally provisioned ocp-edge-cluster-master-1 ipmi://192.168.123.1:6231 true
openshift-master-2 OK externally provisioned ocp-edge-cluster-master-2 ipmi://192.168.123.1:6232 true
openshift-worker-0 OK provisioned ocp-edge-cluster-worker-0-d2fvm ipmi://192.168.123.1:6233 unknown true
openshift-worker-5 OK provisioned ocp-edge-cluster-worker-0-ptklp ipmi://192.168.123.1:6245 unknown true
openshift-worker-9 OK provisioned ocp-edge-cluster-worker-0-jb2tm ipmi://192.168.123.1:6239 unknown true
[kni@worker-2 ~]$ oc get machine -n openshift-machine-api
NAME PHASE TYPE REGION ZONE AGE
ocp-edge-cluster-master-0 4d4h
ocp-edge-cluster-master-1 4d4h
ocp-edge-cluster-master-2 4d4h
ocp-edge-cluster-worker-0-d2fvm 146m
ocp-edge-cluster-worker-0-jb2tm 11m
ocp-edge-cluster-worker-0-ptklp 3h54m
[kni@worker-2 ~]$ oc get node
NAME STATUS ROLES AGE VERSION
master-0 Ready master 4d4h v0.0.0-master+$Format:%h$
master-1 Ready master 4d4h v0.0.0-master+$Format:%h$
master-2 Ready master 4d4h v0.0.0-master+$Format:%h$
worker-0 Ready worker 18m v0.0.0-master+$Format:%h$
worker-5 Ready worker 18m v0.0.0-master+$Format:%h$
worker-9 Ready worker 5m2s v0.0.0-master+$Format:%h$
adding annotation to mark the proper node for deletion:
oc annotate machine ocp-edge-cluster-worker-0-jb2tm machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api
machine.machine.openshift.io/ocp-edge-cluster-worker-0-jb2tm annotated
Deleting the bmh:
[kni@worker-2 ~]$ oc delete bmh openshift-worker-9 -n openshift-machine-api
baremetalhost.metal3.io "openshift-worker-9" deleted
Scaling down the replicas number:
[kni@worker-2 ~]$ oc scale machineset -n openshift-machine-api ocp-edge-cluster-worker-0 --replicas=2
machineset.machine.openshift.io/ocp-edge-cluster-worker-0 scaled
The entry (worker-9) got removed as expected:
[kni@worker-2 ~]$ oc get node
NAME STATUS ROLES AGE VERSION
master-0 Ready master 4d4h v0.0.0-master+$Format:%h$
master-1 Ready master 4d4h v0.0.0-master+$Format:%h$
master-2 Ready master 4d4h v0.0.0-master+$Format:%h$
worker-0 Ready worker 28m v0.0.0-master+$Format:%h$
worker-5 Ready worker 28m v0.0.0-master+$Format:%h$
[kni@worker-2 ~]$ oc get machine -n openshift-machine-api
NAME PHASE TYPE REGION ZONE AGE
ocp-edge-cluster-master-0 4d4h
ocp-edge-cluster-master-1 4d4h
ocp-edge-cluster-master-2 4d4h
ocp-edge-cluster-worker-0-d2fvm 156m
ocp-edge-cluster-worker-0-ptklp 4h5m
[kni@worker-2 ~]$ oc get bmh -n openshift-machine-api
NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR
openshift-master-0 OK externally provisioned ocp-edge-cluster-master-0 ipmi://192.168.123.1:6230 true
openshift-master-1 OK externally provisioned ocp-edge-cluster-master-1 ipmi://192.168.123.1:6231 true
openshift-master-2 OK externally provisioned ocp-edge-cluster-master-2 ipmi://192.168.123.1:6232 true
openshift-worker-0 OK provisioned ocp-edge-cluster-worker-0-d2fvm ipmi://192.168.123.1:6233 unknown true
openshift-worker-5 OK provisioned ocp-edge-cluster-worker-0-ptklp ipmi://192.168.123.1:6245 unknown true
Yet, if I try to connect to the node that got deleted - it's still UP and running.
Expected result:
The removed node should have been powered off automatically.
Please review the following PR: https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/278
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
A user destroying a HostedCluster can cause the HostedCluster to hang indefinitely if the destroy command times out during execution This is due to the hcp cli placing a finalizer on the HostedCluster during deletion which the cli tool later removes after waiting for some clean up actions to occur. If a user cancels the `hcp destroy cluster` command (or the command times out) while the cli is waiting for cleanup, then the HostedCluster will hang indefinitely with a DeletionTimestamp != nil. The cli tool should not be putting the HostedCluster into an un-reconcilable state. All this finializer cleanup logic belongs on the backend.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
100%
Steps to Reproduce:
1. create an hcp cluster 2. destroy the hcp cluster with the cli tool and immediately abort the cli process 3.
Actual results:
HostedCluster is stuck indefinitely during deletion
Expected results:
HostedCluster is able to delete despite the cli being cancelled.
Additional info:
related to https://access.redhat.com/support/cases/#/case/03660218
Description of problem:
After upgrading from OpenShift 4.13 to 4.14 with Kuryr network type, the network operator shows as Degraded and the cluster version reports that it's unable to apply the 4.14 update. The issue seems to be related to mtu settings, as indicated by the message: "Not applying unsafe configuration change: invalid configuration: [cannot change mtu for the Pods Network]."
Version-Release number of selected component (if applicable):
Upgrading from 4.13 to 4.14 4.14.0-0.nightly-2023-09-15-233408 Kuryr network type RHOS-17.1-RHEL-9-20230907.n.1
How reproducible:
Consistently reproducible on attempting to upgrade from 4.13 to 4.14.
Steps to Reproduce:
1.Install OpenShift version 4.13 on OpenStack. 2.Initiate an upgrade to OpenShift version 4.14.
Actual results:
The network operator shows as Degraded with the message: network 4.13.13 True False True 13h Not applying unsafe configuration change: invalid configuration: [cannot change mtu for the Pods Network]. Use 'oc edit network.operator.openshift.io cluster' to undo the change. Additionally, "oc get clusterversions" shows: Unable to apply 4.14.0-0.nightly-2023-09-15-233408: wait has exceeded 40 minutes for these operators: network
Expected results:
The upgrade should complete successfully without any operator being degraded.
Additional info:
Some components remain at version 4.13.13 despite the upgrade attempt. Specifically, the dns, machine-config, and network operators are still at version 4.13.13. : $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.14.0-0.nightly-2023-09-15-233408 True False False 13h baremetal 4.14.0-0.nightly-2023-09-15-233408 True False False 13h cloud-controller-manager 4.14.0-0.nightly-2023-09-15-233408 True False False 13h cloud-credential 4.14.0-0.nightly-2023-09-15-233408 True False False 13h cluster-autoscaler 4.14.0-0.nightly-2023-09-15-233408 True False False 13h config-operator 4.14.0-0.nightly-2023-09-15-233408 True False False 13h console 4.14.0-0.nightly-2023-09-15-233408 True False False 13h control-plane-machine-set 4.14.0-0.nightly-2023-09-15-233408 True False False 13h csi-snapshot-controller 4.14.0-0.nightly-2023-09-15-233408 True False False 13h dns 4.13.13 True False False 13h etcd 4.14.0-0.nightly-2023-09-15-233408 True False False 13h image-registry 4.14.0-0.nightly-2023-09-15-233408 True False False 13h ingress 4.14.0-0.nightly-2023-09-15-233408 True False False 13h insights 4.14.0-0.nightly-2023-09-15-233408 True False False 13h kube-apiserver 4.14.0-0.nightly-2023-09-15-233408 True False False 13h kube-controller-manager 4.14.0-0.nightly-2023-09-15-233408 True False False 13h kube-scheduler 4.14.0-0.nightly-2023-09-15-233408 True False False 13h kube-storage-version-migrator 4.14.0-0.nightly-2023-09-15-233408 True False False 13h machine-api 4.14.0-0.nightly-2023-09-15-233408 True False False 13h machine-approver 4.14.0-0.nightly-2023-09-15-233408 True False False 13h machine-config 4.13.13 True False False 13h marketplace 4.14.0-0.nightly-2023-09-15-233408 True False False 13h monitoring 4.14.0-0.nightly-2023-09-15-233408 True False False 13h network 4.13.13 True False True 13h Not applying unsafe configuration change: invalid configuration: [cannot change mtu for the Pods Network]. Use 'oc edit network.operator.openshift.io cluster' to undo the change. node-tuning 4.14.0-0.nightly-2023-09-15-233408 True False False 12h openshift-apiserver 4.14.0-0.nightly-2023-09-15-233408 True False False 13h openshift-controller-manager 4.14.0-0.nightly-2023-09-15-233408 True False False 13h openshift-samples 4.14.0-0.nightly-2023-09-15-233408 True False False 12h operator-lifecycle-manager 4.14.0-0.nightly-2023-09-15-233408 True False False 13h operator-lifecycle-manager-catalog 4.14.0-0.nightly-2023-09-15-233408 True False False 13h operator-lifecycle-manager-packageserver 4.14.0-0.nightly-2023-09-15-233408 True False False 12h service-ca 4.14.0-0.nightly-2023-09-15-233408 True False False 13h storage 4.14.0-0.nightly-2023-09-15-233408 True False False 13h
Description of problem:
Version-Release number of selected component (if applicable):
When using a route to expose the API server endpoint in a HostedCluster, the .status.controlPlaneEndpoint.port is reported as 6443 (the internal port) instead of 443 which is the port that is externally exposed via the route.
How reproducible:
Always
Steps to Reproduce:
1. Create a HostedCluster with a custom dns name using route as the strategy 3. Inspect .status.controlPlaneEndpoint
Actual results:
It has 6443 as the port
Expected results:
It has 443 as the port
Additional info:
Please review the following PR: https://github.com/openshift/service-ca-operator/pull/226
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/telemeter/pull/496
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The installer doesn’t do precheck if node architecture and vm type are consistent for aws and gcp, it works on azure
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-multi-2023-12-06-195439
How reproducible:
Always
Steps to Reproduce:
1.Config compute architecture field to arm64 but vm type choose amd64 instance type in install-config 2.Create cluster 3.Check installation
Actual results:
Azure will precheck if architecture is consistent with instance type when creating manifests, like: 12-07 11:18:24.452 [INFO] Generating manifests files.....12-07 11:18:24.452 level=info msg=Credentials loaded from file "/home/jenkins/ws/workspace/ocp-common/Flexy-install/flexy/workdir/azurecreds20231207-285-jd7gpj" 12-07 11:18:56.474 level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: controlPlane.platform.azure.type: Invalid value: "Standard_D4ps_v5": instance type architecture 'Arm64' does not match install config architecture amd64 But aws and gcp don’t have precheck, it will fail during installation, but many resources have been created. The case more likely to happen in multiarch cluster
Expected results:
The installer can do a precheck for architecture and vm type , especially for heterogeneous supported platforms(aws,gcp,azure)
Additional info:
Description of problem:
IPI on IBM Cloud does not currently support the new eu-es region
Version-Release number of selected component (if applicable):
4.15
How reproducible:
100%
Steps to Reproduce:
1. Create install-config.yaml for IBM Cloud, per docs, using eu-es region 2. Create the manifests (or cluster) using IPI
Actual results:
level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: invalid "install-config.yaml" file: platform.ibmcloud.region: Unsupported value: "eu-es": supported values: "us-south", "us-east", "jp-tok", "jp-osa", "au-syd", "ca-tor", "eu-gb", "eu-de", "br-sao"
Expected results:
Successful IBM Cloud OCP cluster in eu-es
Additional info:
IBM Cloud has started testing a potential fix, in eu-es to confirm supported cluster types (Public, Private, BYON) all work properly in eu-es
Description of the problem:
We have a validation on vSphere that ensures the disk UUID property is set. However, the agent reports a fake disk in appliance mode, with the "hasUUID" property always set to false.
How reproducible:
100%
Steps to reproduce:
1. Try to install on vSphere
Actual results:
The UUID validation always fails
Expected results:
The UUID validation passes if the UUID property is set on the VM
Description of problem:
In baremetal multinode OCP cluster a node ends up in NotReady state. On the node there are couple of failed services: ● cpuset-configure.service loaded failed failed Move services to reserved cpuset ● on-prem-resolv-prepender.service loaded failed failed Populates resolv.conf according to on-prem IPI needs journalctl --boot --no-pager -u cpuset-configure.service Sep 18 16:57:37 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com systemd[1]: Starting Move services to reserved cpuset... Sep 18 16:57:37 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com cpuset-configure.sh[3014]: /usr/local/bin/cpuset-configure.sh: line 17: /sys/fs/cgroup/cpuset/cpuset.sched_load_balance: Read-only file system Sep 18 16:57:38 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com systemd[1]: cpuset-configure.service: Main process exited, code=exited, status=1/FAILURE Sep 18 16:57:38 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com systemd[1]: cpuset-configure.service: Failed with result 'exit-code'. Sep 18 16:57:38 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com systemd[1]: Failed to start Move services to reserved cpuset. Sep 18 16:57:52 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com systemd[1]: Failed to start Populates resolv.conf according to on-prem IPI needs. Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com systemd[1]: Starting Populates resolv.conf according to on-prem IPI needs... Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4852]: nameserver 10.47.242.10 Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4851]: NM resolv-prepender: Starting download of baremetal runtime cfg image Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4854]: Trying to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:23012b3380ffce706aa8f204cdc26745d8a69b0218150ec3bcb495202694fdab... Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4854]: Getting image source signatures Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4854]: Copying blob sha256:916ead524b9e54b9d5534b65534253c02ce66f1d784e683389aa3c4cb4d12389 Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4854]: Copying blob sha256:d8190195889efb5333eeec18af9b6c82313edd4db62989bd3a357caca4f13f0e Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4854]: Copying blob sha256:c71d2589fba7989ecd29ea120fe7add01fab70126fc653a863d5844e35ee5403 Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4854]: Copying blob sha256:97da74cc6d8fa5d1634eb1760fd1da5c6048619c264c23e62d75f3bf6b8ef5c4 Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4854]: Copying blob sha256:d4dc6e74b6ce09e24dc284cc1967451f3dda2d485bc92fc95d24d91f939e4849 Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4854]: Copying config sha256:ba2c86ef11c4e341cd0870b6d5b7ad39aa39724389d9d2dfead4ea3d75582071 Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4854]: Writing manifest to image destination Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4854]: Storing signatures Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4854]: ba2c86ef11c4e341cd0870b6d5b7ad39aa39724389d9d2dfead4ea3d75582071 Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4851]: NM resolv-prepender: Download of baremetal runtime cfg image completed Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4863]: Your kernel does not support pids limit capabilities or the cgroup is not mounted. PIDs limit discarded. Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com resolv-prepender.sh[4863]: Error: OCI runtime error: runc: runc create failed: mountpoint for devices not found Sep 18 16:57:53 openshift-worker-3.ecore.lab.eng.tlv2.redhat.com systemd[1]: on-prem-resolv-prepender.service: Main process exited, code=exited, status=127/n/a When checking CGroup config:
oc describe node.config Name: cluster Namespace: Labels: <none> Annotations: include.release.openshift.io/ibm-cloud-managed: true include.release.openshift.io/self-managed-high-availability: true include.release.openshift.io/single-node-developer: true release.openshift.io/create-only: true API Version: config.openshift.io/v1 Kind: Node Metadata: Creation Timestamp: 2023-09-18T15:27:44Z Generation: 3 Owner References: API Version: config.openshift.io/v1 Kind: ClusterVersion Name: version UID: c62da215-6526-4306-8fc6-035612c8605e Resource Version: 91518 UID: cf2189ba-cd69-45e9-868c-7c2589decb25 Spec: Cgroup Mode: v1 Events: <none>
Version-Release number of selected component (if applicable):
4.14.0-rc.1
How reproducible:
so far 100%
Steps to Reproduce:
1. Deploy baremetal multinode cluster with GitOps-ZTP workflow 2. 3.
Actual results:
While all policies report Complaint state some configs are still being applied: oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE ht100gb rendered-ht100gb-572f5aef443a21b21a8c5cfe816708e2 False True False 2 0 0 0 77m master rendered-master-3c44ec28c389693028ad2cc6b74741ca True False False 3 3 3 0 103m standard rendered-standard-1942568110455a377b735e15f18c7ba8 True False False 2 2 2 0 77m worker rendered-worker-033d4f0a2568efce241d02a2c54ab88e True False False 0 0 0 0 103m
Expected results:
All nodes are in Ready state
Additional info:
Please review the following PR: https://github.com/openshift/ibm-vpc-block-csi-driver-operator/pull/78
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Seeing CI jobs with
> level=error msg=ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods).
search shows 65 hits in the last 7 days
Please review the following PR: https://github.com/openshift/oc/pull/1542
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-26765. The following is the description of the original issue:
—
Description of problem:
The SAST scans keep coming up with bogus positive results from test and vendor files. This bug is just a placeholder to allow us to backport the change to ignore those files.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Managed OpenShift (OSD, ROSA) on-cluster console should have their update buttons greyed-out (disabled) so that customers don't suffer the error related to webhooks blocking updates. (since OSD and ROSA need the OCM UI or ROSA CLI in order to do updates)
As managed services governs when we allow specific update versions, this change would support that without letting the user encounter an unnecessary error.
Description of problem:
When we merged https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/229, it changed the way failure domains were injected for Azure so that additional fields could be accounted for. However, the CPMS failure domains have Azure zones as a string (which they should be) and the machine v1beta1 spec has them as a string pointer. This means now that the CPMS is detecting the difference between the a nil zone and an empty string, even though every other piece of code in openshift treats them the same. We should update the machine v1beta1 type to remove the pointer. This will be a no-op in terms of the data stored in etcd since the type is unstructured anyway. It will then require updates to the MAPZ, CPMS, MAO and installer repositories to update their generation.
Version-Release number of selected component (if applicable):
4.14 nightlies from the merge of 229 onwards
How reproducible:
This is only affecting regions in Azure where there are no zones, currently in CI it's affecting about 20% of events.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-29103. The following is the description of the original issue:
—
Description of problem:
The HCP CSR flow allows any CN in the incoming CSR.
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
Using the CSR flow, any name you add to the CN in the CSR will be your username against the Kubernetes API server - check your username using the SelfSubjectRequest API (kubectl auth whoami)
Steps to Reproduce:
1.create CSR with CN=whatever 2.CSR signed, create kubeconfig 3.using kubeconfig, kubectl auth whoami should show whatever CN
Actual results:
any CN in CSR is the username against the cluster
Expected results:
we should only allow CNs with some known prefix (system:customer-break-glass:...)
Additional info:
Please review the following PR: https://github.com/openshift/csi-external-provisioner/pull/69
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/machine-api-provider-nutanix/pull/51
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Observer - Alerting, Metrics, and Targets page does not load as expected, blank page would be shown
4.15.0-0.nightly-2023-12-07-041003
Always
1.Navigate to Observer -> Alerting, Metrics, and Targets page directly 2. 3.
Blank page, no data be loaded
Work as normal
Failed to load resource: the server responded with a status of 404 (Not Found) /api/accounts_mgmt/v1/subscriptions?page=1&search=external_cluster_id%3D%2715ace915-53d3-4455-b7e3-b7a5a4796b5c%27:1 Failed to load resource: the server responded with a status of 403 (Forbidden) main-chunk-bb9ed989a7f7c65da39a.min.js:1 API call to get support level has failed r: Access denied due to cluster policy. at https://console-openshift-console.apps.ci-ln-9fl1l5t-76ef8.origin-ci-int-aws.dev.rhcloud.com/static/main-chunk-bb9ed989a7f7c65da39a.min.js:1:95279 (anonymous) @ main-chunk-bb9ed989a7f7c65da39a.min.js:1 /api/kubernetes/apis/operators.coreos.com/v1alpha1/namespaces/#ALL_NS#/clusterserviceversions?:1 Failed to load resource: the server responded with a status of 404 (Not Found) vendor-patternfly-5~main-chunk-95cb256d9fa7738d2c46.min.js:1 Modal: When using hasNoBodyWrapper or setting a custom header, ensure you assign an accessible name to the the modal container with aria-label or aria-labelledby.
Description of problem:
When deploying with a service ID, the installer is unable to query resource groups.
Version-Release number of selected component (if applicable):
4.13-4.16
How reproducible:
Easily
Steps to Reproduce:
1. Create a service ID with seemingly enough permissions to do an IPI install 2. Deploy to power vs with IPI 3. Fail
Actual results:
Fail to deploy a cluster with service ID
Expected results:
cluster create should succeed
Additional info:
Please review the following PR: https://github.com/openshift/machine-config-operator/pull/3918
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When accessing the URL https://api.test.lab.domain.com:6443/.well-known/openid-configuration an jwks_uri endpoint containing an api-int URL is returned. We expect that this endpoint would be on api instead of api-int.
Version-Release number of selected component (if applicable):
4.11
How reproducible:
100%
Steps to Reproduce:
1. From web browser access https://api.test.lab.domain.com:6443/.well-known/openid-configuration 2. From CLI try curl -kvv https://api.test.lab.domain.com:6443/.well-known/openid-configuration 3. The output is as below. The jwks_uri returned is pointing to api-int but I think it should be api ~~~~~ {"issuer":"https://kubernetes.default.svc","jwks_uri":"https://api-int.test.lab.domain.com:6443/openid/v1/jwks","response_types_supported":["id_token"],"subject_types_supported":["public"],"id_token_signing_alg_values_supported":["RS256"]} ~~~~~
Actual results:
"jwks_uri":"https://api-int.test.lab.domain.com:6443/openid/v1/jwks
Expected results:
"jwks_uri":"https://api.test.lab.domain.com:6443/openid/v1/jwks
Additional info:
Description of problem:
Master only installations with workers set to replicas 0 should be supported in UPI. At the moment, the ingress rules that are enabled on workers are not enabled on master as well.
Context: https://bugzilla.redhat.com/show_bug.cgi?id=1955544
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
`oc adm upgrade` silently errors out on incorrect subcommand without doing/notifying anything
this is due to the `default` case in `run()` which catches all the incorrect subcommand and runs the default part instead.
Version-Release number of selected component (if applicable): 4.10 and current
How reproducible:
use any incorrect subcommand with `oc adm upgrade`.
example: `oc adm upgrade incorrect-subcommand`
Steps to Reproduce:
1. run `oc adm upgrade incorrect-subcommand`
Actual results:
oc prints the cluster upgrade status
Expected results:
oc should error out saying incorrect subcommand
Description of problem:
If a cluster is installed using proxy and the username used for connecting to the proxy contains the characters "%40" for encoding a "@" in case of providing a doamin, the instalation fails. The failure is because the proxy variables implemented in the file "/etc/systemd/system.conf.d/10-default-env.conf" in the bootstrap node are ignored by systemd. This issue seems was already fixed in MCO (BZ 1882674 - fixed in RHOCP 4.7), but looks like is affecting the bootstrap process in 4.13 and 4.14, causing the installation to not start at all.
Version-Release number of selected component (if applicable):
4.14, 4.13
How reproducible:
100% always
Steps to Reproduce:
1. create a install-config.yaml file with "%40" in the middle of the username used for proxy. 2. start cluster installation. 3. bootstrap will fail for not using proxy variables.
Actual results:
Installation fails because systemd fails to load the proxy varaibles if "%" is present in the username.
Expected results:
Installation to succeed using a username with "%40" for the proxy.
Additional info:
File "/etc/systemd/system.conf.d/10-default-env.conf" for the bootstrap should be generated in a way accepted by systemd.
Please review the following PR: https://github.com/openshift/cluster-olm-operator/pull/35
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Build02, a years old cluster currently running 4.15.0-ec.2 with TechPreviewNoUpgrade, has been Available=False for days:
$ oc get -o json clusteroperator monitoring | jq '.status.conditions[] | select(.type == "Available")' { "lastTransitionTime": "2024-01-14T04:09:52Z", "message": "UpdatingMetricsServer: reconciling MetricsServer Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/metrics-server: context deadline exceeded", "reason": "UpdatingMetricsServerFailed", "status": "False", "type": "Available" }
Both pods had been having CA trust issues. We deleted one pod, and it's replacement is happy:
$ oc -n openshift-monitoring get -l app.kubernetes.io/component=metrics-server pods NAME READY STATUS RESTARTS AGE metrics-server-9cc8bfd56-dd5tx 1/1 Running 0 136m metrics-server-9cc8bfd56-k2lpv 0/1 Running 0 36d
The young, happy pod has occasional node-removed noise, which is expected in this cluster with high levels of compute-node autoscaling:
$ oc -n openshift-monitoring logs --tail 3 metrics-server-9cc8bfd56-dd5tx E0117 17:16:13.492646 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.32.33:10250/metrics/resource\": dial tcp 10.0.32.33:10250: connect: connection refused" node="build0-gstfj-ci-builds-worker-b-srjk5" E0117 17:16:28.611052 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.32.33:10250/metrics/resource\": dial tcp 10.0.32.33:10250: connect: connection refused" node="build0-gstfj-ci-builds-worker-b-srjk5" E0117 17:16:56.898453 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.32.33:10250/metrics/resource\": context deadline exceeded" node="build0-gstfj-ci-builds-worker-b-srjk5"
While the old, sad pod is complaining about unknown authorities:
$ oc -n openshift-monitoring logs --tail 3 metrics-server-9cc8bfd56-k2lpv E0117 17:19:09.612161 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.0.3:10250/metrics/resource\": tls: failed to verify certificate: x509: certificate signed by unknown authority" node="build0-gstfj-m-2.c.openshift-ci-build-farm.internal" E0117 17:19:09.620872 1 scraper.go:140] "Failed to scrape node" err="Get \"https://10.0.32.90:10250/metrics/resource\": tls: failed to verify certificate: x509: certificate signed by unknown authority" node="build0-gstfj-ci-prowjobs-worker-b-cg7qd" I0117 17:19:14.538837 1 server.go:187] "Failed probe" probe="metric-storage-ready" err="no metrics to serve"
More details in the Additional details section, but the timeline seems to have been something like:
So addressing the metrics-server /etc/tls/kubelet-serving-ca-bundle/ca-bundle.crt change detection should resolve this use-case. And triggering a container or pod restart would be an aggressive-but-sufficient mechanism, although loading the new data without rolling the process would be less invasive.
4.15.0-ec.3, which has fast CA rotation, see discussion in API-1687.
Unclear.
Unclear.
metrics-server pods having trouble with CA trust when attempting to scrape nodes.
metrics-server pods successfully trusting kubelets when scraping nodes.
The monitoring operator sets up the metrics server with --kubelet-certificate-authority=/etc/tls/kubelet-serving-ca-bundle/ca-bundle.crt, which is the "Path to the CA to use to validate the Kubelet's serving certificates" and is mounted from the kubelet-serving-ca-bundle ConfigMap. But that mount point only contains openshift-kube-controller-manager-operator_csr-signer-signer@... CAs:
$ oc --as system:admin -n openshift-monitoring debug pod/metrics-server-9cc8bfd56-k2lpv -- cat /etc/tls/kubelet-serving-ca-bundle/ca-bundle.crt | while openssl x509 -noout -text; do :; done | grep '^Certificate:\|Issuer\|Subject:\|Not ' Starting pod/metrics-server-9cc8bfd56-k2lpv-debug-gtctn ... Removing debug pod ... Certificate: Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554 Not Before: Dec 3 14:42:33 2023 GMT Not After : Feb 1 14:42:34 2024 GMT Subject: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554 Certificate: Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554 Not Before: Dec 20 03:16:35 2023 GMT Not After : Jan 19 03:16:36 2024 GMT Subject: CN = kube-csr-signer_@1703042196 Certificate: Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554 Not Before: Jan 4 03:16:35 2024 GMT Not After : Feb 3 03:16:36 2024 GMT Subject: CN = kube-csr-signer_@1704338196 Certificate: Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554 Not Before: Jan 2 14:42:34 2024 GMT Not After : Mar 2 14:42:35 2024 GMT Subject: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554 unable to load certificate 137730753918272:error:0909006C:PEM routines:get_name:no start line:../crypto/pem/pem_lib.c:745:Expecting: TRUSTED CERTIFICATE
While actual kubelets seem to be using certs signed by kube-csr-signer_@1704338196 (which is one of the Subjects in /etc/tls/kubelet-serving-ca-bundle/ca-bundle.crt):
$ oc get -o wide -l node-role.kubernetes.io/master= nodes NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME build0-gstfj-m-0.c.openshift-ci-build-farm.internal Ready master 3y240d v1.28.3+20a5764 10.0.0.4 <none> Red Hat Enterprise Linux CoreOS 415.92.202311271112-0 (Plow) 5.14.0-284.41.1.el9_2.x86_64 cri-o://1.28.2-2.rhaos4.15.gite7be4e1.el9 build0-gstfj-m-1.c.openshift-ci-build-farm.internal Ready master 3y240d v1.28.3+20a5764 10.0.0.5 <none> Red Hat Enterprise Linux CoreOS 415.92.202311271112-0 (Plow) 5.14.0-284.41.1.el9_2.x86_64 cri-o://1.28.2-2.rhaos4.15.gite7be4e1.el9 build0-gstfj-m-2.c.openshift-ci-build-farm.internal Ready master 3y240d v1.28.3+20a5764 10.0.0.3 <none> Red Hat Enterprise Linux CoreOS 415.92.202311271112-0 (Plow) 5.14.0-284.41.1.el9_2.x86_64 cri-o://1.28.2-2.rhaos4.15.gite7be4e1.el9 $ oc --as system:admin -n openshift-monitoring debug pod/metrics-server-9cc8bfd56-k2lpv -- openssl s_client -connect 10.0.0.3:10250 -showcerts </dev/null Starting pod/metrics-server-9cc8bfd56-k2lpv-debug-ksl2k ... Can't use SSL_get_servername depth=0 O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal verify error:num=20:unable to get local issuer certificate verify return:1 depth=0 O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal verify error:num=21:unable to verify the first certificate verify return:1 depth=0 O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal verify return:1 CONNECTED(00000003) --- Certificate chain 0 s:O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal i:CN = kube-csr-signer_@1704338196 -----BEGIN CERTIFICATE----- MIIC5DCCAcygAwIBAgIQAbKVl+GS6s2H20EHAWl4WzANBgkqhkiG9w0BAQsFADAm MSQwIgYDVQQDDBtrdWJlLWNzci1zaWduZXJfQDE3MDQzMzgxOTYwHhcNMjQwMTE3 MDMxNDMwWhcNMjQwMjAzMDMxNjM2WjBhMRUwEwYDVQQKEwxzeXN0ZW06bm9kZXMx SDBGBgNVBAMTP3N5c3RlbTpub2RlOmJ1aWxkMC1nc3Rmai1tLTIuYy5vcGVuc2hp ZnQtY2ktYnVpbGQtZmFybS5pbnRlcm5hbDBZMBMGByqGSM49AgEGCCqGSM49AwEH A0IABFqT+UgohFAxJrGYQUeYsEhNB+ufFo14xYDedKBCeNzMhaC+5/I4UN1e1u2X PH7J4ncmH+M/LXI7v+YfEIG7cH+jgZ0wgZowDgYDVR0PAQH/BAQDAgeAMBMGA1Ud JQQMMAoGCCsGAQUFBwMBMAwGA1UdEwEB/wQCMAAwHwYDVR0jBBgwFoAU394ABuS2 9i0qss9AKk/mQ9lhJ88wRAYDVR0RBD0wO4IzYnVpbGQwLWdzdGZqLW0tMi5jLm9w ZW5zaGlmdC1jaS1idWlsZC1mYXJtLmludGVybmFshwQKAAADMA0GCSqGSIb3DQEB CwUAA4IBAQCiKelqlgK0OHFqDPdIR+RRdjXoCfFDa0JGCG0z60LYJV6Of5EPv0F/ vGZdM/TyGnPT80lnLCh2JGUvneWlzQEZ7LEOgXX8OrAobijiFqDZFlvVwvkwWNON rfucLQWDFLHUf/yY0EfB0ZlM8Sz4XE8PYB6BXYvgmUIXS1qkV9eGWa6RPLsOnkkb q/dTLE/tg8cz24IooDC8lmMt/wCBPgsq9AnORgNdZUdjCdh9DpDWCw0E4csSxlx2 H1qlH5TpTGKS8Ox9JAfdAU05p/mEhY9PEPSMfdvBZep1xazrZyQIN9ckR2+11Syw JlbEJmapdSjIzuuKBakqHkDgoq4XN0KM -----END CERTIFICATE----- --- Server certificate subject=O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal issuer=CN = kube-csr-signer_@1704338196 --- Acceptable client certificate CA names OU = openshift, CN = admin-kubeconfig-signer CN = openshift-kube-controller-manager-operator_csr-signer-signer@1699022534 CN = kube-csr-signer_@1700450189 CN = kube-csr-signer_@1701746196 CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554 CN = openshift-kube-apiserver-operator_kube-apiserver-to-kubelet-signer@1691004449 CN = openshift-kube-apiserver-operator_kube-control-plane-signer@1702234292 CN = openshift-kube-apiserver-operator_kube-control-plane-signer@1699642292 OU = openshift, CN = kubelet-bootstrap-kubeconfig-signer CN = openshift-kube-apiserver-operator_node-system-admin-signer@1678905372 Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512:RSA+SHA1:ECDSA+SHA1 Shared Requested Signature Algorithms: RSA-PSS+SHA256:ECDSA+SHA256:Ed25519:RSA-PSS+SHA384:RSA-PSS+SHA512:RSA+SHA256:RSA+SHA384:RSA+SHA512:ECDSA+SHA384:ECDSA+SHA512 Peer signing digest: SHA256 Peer signature type: ECDSA Server Temp Key: X25519, 253 bits --- SSL handshake has read 1902 bytes and written 383 bytes Verification error: unable to verify the first certificate --- New, TLSv1.3, Cipher is TLS_AES_128_GCM_SHA256 Server public key is 256 bit Secure Renegotiation IS NOT supported Compression: NONE Expansion: NONE No ALPN negotiated Early data was not sent Verify return code: 21 (unable to verify the first certificate) --- DONE Removing debug pod ... $ openssl x509 -noout -text <<EOF 2>/dev/null > -----BEGIN CERTIFICATE----- MIIC5DCCAcygAwIBAgIQAbKVl+GS6s2H20EHAWl4WzANBgkqhkiG9w0BAQsFADAm MSQwIgYDVQQDDBtrdWJlLWNzci1zaWduZXJfQDE3MDQzMzgxOTYwHhcNMjQwMTE3 MDMxNDMwWhcNMjQwMjAzMDMxNjM2WjBhMRUwEwYDVQQKEwxzeXN0ZW06bm9kZXMx SDBGBgNVBAMTP3N5c3RlbTpub2RlOmJ1aWxkMC1nc3Rmai1tLTIuYy5vcGVuc2hp ZnQtY2ktYnVpbGQtZmFybS5pbnRlcm5hbDBZMBMGByqGSM49AgEGCCqGSM49AwEH A0IABFqT+UgohFAxJrGYQUeYsEhNB+ufFo14xYDedKBCeNzMhaC+5/I4UN1e1u2X PH7J4ncmH+M/LXI7v+YfEIG7cH+jgZ0wgZowDgYDVR0PAQH/BAQDAgeAMBMGA1Ud JQQMMAoGCCsGAQUFBwMBMAwGA1UdEwEB/wQCMAAwHwYDVR0jBBgwFoAU394ABuS2 9i0qss9AKk/mQ9lhJ88wRAYDVR0RBD0wO4IzYnVpbGQwLWdzdGZqLW0tMi5jLm9w ZW5zaGlmdC1jaS1idWlsZC1mYXJtLmludGVybmFshwQKAAADMA0GCSqGSIb3DQEB CwUAA4IBAQCiKelqlgK0OHFqDPdIR+RRdjXoCfFDa0JGCG0z60LYJV6Of5EPv0F/ vGZdM/TyGnPT80lnLCh2JGUvneWlzQEZ7LEOgXX8OrAobijiFqDZFlvVwvkwWNON rfucLQWDFLHUf/yY0EfB0ZlM8Sz4XE8PYB6BXYvgmUIXS1qkV9eGWa6RPLsOnkkb q/dTLE/tg8cz24IooDC8lmMt/wCBPgsq9AnORgNdZUdjCdh9DpDWCw0E4csSxlx2 H1qlH5TpTGKS8Ox9JAfdAU05p/mEhY9PEPSMfdvBZep1xazrZyQIN9ckR2+11Syw JlbEJmapdSjIzuuKBakqHkDgoq4XN0KM -----END CERTIFICATE----- > EOF ... Issuer: CN = kube-csr-signer_@1704338196 Validity Not Before: Jan 17 03:14:30 2024 GMT Not After : Feb 3 03:16:36 2024 GMT Subject: O = system:nodes, CN = system:node:build0-gstfj-m-2.c.openshift-ci-build-farm.internal ...
The monitoring operator populates the openshift-monitoring kubelet-serving-ca-bundle} ConfigMap using data from the openshift-config-managed kubelet-serving-ca ConfigMap, and that propagation is working, but does not contain the kube-csr-signer_ CA:
$ oc -n openshift-config-managed get -o json configmap kubelet-serving-ca | jq -r '.data["ca-bundle.crt"]' | while openssl x509 -noout -text; do :; done | grep '^Certificate:\|Issuer\|Subject:\|Not ' Certificate: Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554 Not Before: Dec 3 14:42:33 2023 GMT Not After : Feb 1 14:42:34 2024 GMT Subject: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554 Certificate: Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1701614554 Not Before: Dec 20 03:16:35 2023 GMT Not After : Jan 19 03:16:36 2024 GMT Subject: CN = kube-csr-signer_@1703042196 Certificate: Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554 Not Before: Jan 4 03:16:35 2024 GMT Not After : Feb 3 03:16:36 2024 GMT Subject: CN = kube-csr-signer_@1704338196 Certificate: Issuer: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554 Not Before: Jan 2 14:42:34 2024 GMT Not After : Mar 2 14:42:35 2024 GMT Subject: CN = openshift-kube-controller-manager-operator_csr-signer-signer@1704206554 unable to load certificate 140531510617408:error:0909006C:PEM routines:get_name:no start line:../crypto/pem/pem_lib.c:745:Expecting: TRUSTED CERTIFICATE $ oc -n openshift-config-managed get -o json configmap kubelet-serving-ca | jq -r '.data["ca-bundle.crt"]' | sha1sum a32ab44dff8030c548087d70fea599b0d3fab8af - $ oc -n openshift-monitoring get -o json configmap kubelet-serving-ca-bundle | jq -r '.data["ca-bundle.crt"]' | sha1sum a32ab44dff8030c548087d70fea599b0d3fab8af -
Flipping over to the kubelet side, nothing in the machine-config operator's template is jumping out at me as a key/cert pair for serving on 10250. The kubelet seems to set up server certs via serverTLSBootstrap: true. But we don't seem to set the beta RotateKubeletServerCertificate, so I'm not clear on how these are supposed to rotate on the kubelet side. But there are CSRs from kubelets requesting serving certs:
$ oc get certificatesigningrequests | grep 'NAME\|kubelet-serving' NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION csr-8stgd 51m kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-builds-worker-b-xkdw2 <none> Approved,Issued csr-blbjx 9m1s kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-longtests-worker-b-5w9dz <none> Approved,Issued csr-ghxh5 64m kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-builds-worker-b-sdwdn <none> Approved,Issued csr-hng85 33m kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-longtests-worker-d-7d7h2 <none> Approved,Issued csr-hvqxz 24m kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-builds-worker-b-fp6wb <none> Approved,Issued csr-vc52m 50m kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-builds-worker-b-xlmt6 <none> Approved,Issued csr-vflcm 40m kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-builds-worker-b-djpgq <none> Approved,Issued csr-xfr7d 51m kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-builds-worker-b-8v4vk <none> Approved,Issued csr-zhzbs 51m kubernetes.io/kubelet-serving system:node:build0-gstfj-ci-builds-worker-b-rqr68 <none> Approved,Issued $ oc get -o json certificatesigningrequests csr-blbjx { "apiVersion": "certificates.k8s.io/v1", "kind": "CertificateSigningRequest", "metadata": { "creationTimestamp": "2024-01-17T19:20:43Z", "generateName": "csr-", "name": "csr-blbjx", "resourceVersion": "4719586144", "uid": "5f12d236-3472-485f-8037-3896f51a809c" }, "spec": { "groups": [ "system:nodes", "system:authenticated" ], "request": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQlh6Q0NBUVFDQVFBd1ZqRVZNQk1HQTFVRUNoTU1jM2x6ZEdWdE9tNXZaR1Z6TVQwd093WURWUVFERXpSegplWE4wWlcwNmJtOWtaVHBpZFdsc1pEQXRaM04wWm1vdFkya3RiRzl1WjNSbGMzUnpMWGR2Y210bGNpMWlMVFYzCk9XUjZNRmt3RXdZSEtvWkl6ajBDQVFZSUtvWkl6ajBEQVFjRFFnQUV5Y0dhSDMvZ3F4ZHNZWkdmQXovTEpoZVgKd1o0Z1VRbjB6TlZUenJncHpvd1VPOGR6NTN4UUZTOTRibm40NldlZFg3Q2xidUpVSUpUN2pCblV1WEdnZktCTQpNRW9HQ1NxR1NJYjNEUUVKRGpFOU1Ec3dPUVlEVlIwUkJESXdNSUlvWW5WcGJHUXdMV2R6ZEdacUxXTnBMV3h2CmJtZDBaWE4wY3kxM2IzSnJaWEl0WWkwMWR6bGtlb2NFQ2dBZ0F6QUtCZ2dxaGtqT1BRUURBZ05KQURCR0FpRUEKMHlRVzZQOGtkeWw5ZEEzM3ppQTJjYXVJdlhidTVhczNXcUZLYWN2bi9NSUNJUURycEQyVEtScHJOU1I5dExKTQpjZ0ZpajN1dVNieVJBcEJ5NEE1QldEZm02UT09Ci0tLS0tRU5EIENFUlRJRklDQVRFIFJFUVVFU1QtLS0tLQo=", "signerName": "kubernetes.io/kubelet-serving", "usages": [ "digital signature", "server auth" ], "username": "system:node:build0-gstfj-ci-longtests-worker-b-5w9dz" }, "status": { "certificate": "LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUN6ekNDQWJlZ0F3SUJBZ0lSQUlGZ1NUd0ovVUJLaE1hWlE4V01KcEl3RFFZSktvWklodmNOQVFFTEJRQXcKSmpFa01DSUdBMVVFQXd3YmEzVmlaUzFqYzNJdGMybG5ibVZ5WDBBeE56QTBNek00TVRrMk1CNFhEVEkwTURFeApOekU1TVRVME0xb1hEVEkwTURJd016QXpNVFl6Tmxvd1ZqRVZNQk1HQTFVRUNoTU1jM2x6ZEdWdE9tNXZaR1Z6Ck1UMHdPd1lEVlFRREV6UnplWE4wWlcwNmJtOWtaVHBpZFdsc1pEQXRaM04wWm1vdFkya3RiRzl1WjNSbGMzUnoKTFhkdmNtdGxjaTFpTFRWM09XUjZNRmt3RXdZSEtvWkl6ajBDQVFZSUtvWkl6ajBEQVFjRFFnQUV5Y0dhSDMvZwpxeGRzWVpHZkF6L0xKaGVYd1o0Z1VRbjB6TlZUenJncHpvd1VPOGR6NTN4UUZTOTRibm40NldlZFg3Q2xidUpVCklKVDdqQm5VdVhHZ2ZLT0JrakNCanpBT0JnTlZIUThCQWY4RUJBTUNCNEF3RXdZRFZSMGxCQXd3Q2dZSUt3WUIKQlFVSEF3RXdEQVlEVlIwVEFRSC9CQUl3QURBZkJnTlZIU01FR0RBV2dCVGYzZ0FHNUxiMkxTcXl6MEFxVCtaRAoyV0VuenpBNUJnTlZIUkVFTWpBd2dpaGlkV2xzWkRBdFozTjBabW90WTJrdGJHOXVaM1JsYzNSekxYZHZjbXRsCmNpMWlMVFYzT1dSNmh3UUtBQ0FETUEwR0NTcUdTSWIzRFFFQkN3VUFBNElCQVFBRE5ad0pMdkp4WWNta2RHV08KUm5ocC9rc3V6akJHQnVHbC9VTmF0RjZScml3eW9mdmpVNW5Kb0RFbGlLeHlDQ2wyL1d5VXl5a2hMSElBK1drOQoxZjRWajIrYmZFd0IwaGpuTndxQThudFFabS90TDhwalZ5ZzFXM0VwR2FvRjNsZzRybDA1cXBwcjVuM2l4WURJClFFY2ZuNmhQUnlKN056dlFCS0RwQ09lbU8yTFllcGhqbWZGY2h5VGRZVGU0aE9IOW9TWTNMdDdwQURIM2kzYzYKK3hpMDhhV09LZmhvT3IybTVBSFBVN0FkTjhpVUV0M0dsYzI0SGRTLzlLT05tT2E5RDBSSk9DMC8zWk5sKzcvNAoyZDlZbnYwaTZNaWI3OGxhNk5scFB0L2hmOWo5TlNnMDN4OFZYRVFtV21zN29xY1FWTHMxRHMvWVJ4VERqZFphCnEwMnIKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=", "conditions": [ { "lastTransitionTime": "2024-01-17T19:20:43Z", "lastUpdateTime": "2024-01-17T19:20:43Z", "message": "This CSR was approved by the Node CSR Approver (cluster-machine-approver)", "reason": "NodeCSRApprove", "status": "True", "type": "Approved" } ] } } $ oc get -o json certificatesigningrequests csr-blbjx | jq -r '.status.certificate | @base64d' | openssl x509 -noout -text | grep '^Certificate:\|Issuer\|Subject:\|Not ' Certificate: Issuer: CN = kube-csr-signer_@1704338196 Not Before: Jan 17 19:15:43 2024 GMT Not After : Feb 3 03:16:36 2024 GMT Subject: O = system:nodes, CN = system:node:build0-gstfj-ci-longtests-worker-b-5w9dz
So that's approved by cluster-machine-approver, but signerName: kubernetes.io/kubelet-serving is an upstream Kubernetes component documented here, and the signer is implemented by kube-controller-manager.
To facilitate testing manifest generation, extract OpenStack API calls from the function body.
Please review the following PR: https://github.com/openshift/cloud-provider-aws/pull/58
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
There is currently no way to interrupt a stuck HostedCluster upgrade because we don't allow another upgrade until the current upgrade is finished. At the very least we should allow overriding the upgrade with the ForceUpgradeTo annotation.
The function name doesn't honour the behaviour.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Install hosted cluster 2. Start upgrade to a bad release that will not complete 3. Attempt to override the current upgrade with a different release via annotation.
Actual results:
The override upgrade is not applied because the initial upgrade is not completed.
Expected results:
The override upgrade starts and completes successfully.
Additional info:
https://github.com/openshift/hypershift/blob/572a75655f0d86d6e2139f27e14eb1b168a5842b/hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go#L4123-L4135
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/104
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
because of the pin in the packages list the ART pipeline is rebuilding packages all the time
unfortunately we need to remove the strong pins and move back to relaxed ones
once that's done we need to merge https://github.com/openshift-eng/ocp-build-data/pull/4097
Cluster with 103 nodes failed to update in UI on Networking page
with Dialog error: "The service is down, undergoing maintenance, or experiencing another issue."
And error in UI:
"[10] Message Size Too Large: the server has a configurable maximum message size to avoid
unbounded memory allocation and the client attempted to produce a message larger than this maximum"
And in browser Debugger
PATCH https://api.stage.openshift.com/api/assisted-install/v2/clusters/674c7056-4db9-4ea6-9f1d-f976fc77897e 500 (Internal Server Error)
See attached screenshot
Steps to reproduce:
1. Create cluster, generate minimal ISO image, download to servers
2. Boot 103 nodes with ISO image
3. Wait all nodes finished discovering
4. Click Next , Next
5. Set API and Ingress VIP in Networking page
Actual results:
Raise error dialog: Unable to update cluster
The service is down, undergoing maintenance, or experiencing another issue.
and ask to Refresh. Which return back to Cluster details page
Expected results:
Should update cluster and allow continue to install cluster
This is a clone of issue OCPBUGS-20368. The following is the description of the original issue:
—
Description of problem:
Automate E2E tests of Dynamic OVS Pinning. This bug is created for merging
https://github.com/openshift/cluster-node-tuning-operator/pull/746
Version-Release number of selected component (if applicable):
4.15.0
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/csi-driver-shared-resource/pull/156
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed preparing ignition data: ignition failed to provision storage: failed to create storage: failed to create bucket: googleapi: Error 409: Your previous request to create the named bucket succeeded and you already own it., conflict
If the user does not specify a rendezvousIP and instead leaves it to the installer to choose one of the configured static IPs, it always picks the lowest IP. If no roles are assigned, this host will become part of the control plane.
If the user assigns the lowest IP to a host to which they also assign a worker role, the install will fail.
It's not clear what will happen if the role is not explicitly set on the host with the lowest IP, but there are already sufficient control plane nodes assigned from among the other hosts. In any event, this wouldn't be good.
We should select a static IP among only the hosts that are eligible to become part of the control plane.
A user can work around this by explicitly specifying the rendezvousIP.
This is a clone of issue OCPBUGS-25673. The following is the description of the original issue:
—
Description of problem:
CNV upgrades from v4.14.1 to v4.15.0 (unreleased) are not starting due to out of sync operatorCondition.
We see:
$ oc get csv NAME DISPLAY VERSION REPLACES PHASE kubevirt-hyperconverged-operator.v4.14.1 OpenShift Virtualization 4.14.1 kubevirt-hyperconverged-operator.v4.14.0 Replacing kubevirt-hyperconverged-operator.v4.15.0 OpenShift Virtualization 4.15.0 kubevirt-hyperconverged-operator.v4.14.1 Pending
And on the v4.15.0 CSV:
$ oc get csv kubevirt-hyperconverged-operator.v4.15.0 -o yaml .... status: cleanup: {} conditions: - lastTransitionTime: "2023-12-19T01:50:48Z" lastUpdateTime: "2023-12-19T01:50:48Z" message: requirements not yet checked phase: Pending reason: RequirementsUnknown - lastTransitionTime: "2023-12-19T01:50:48Z" lastUpdateTime: "2023-12-19T01:50:48Z" message: 'operator is not upgradeable: the operatorcondition status "Upgradeable"="True" is outdated' phase: Pending reason: OperatorConditionNotUpgradeable lastTransitionTime: "2023-12-19T01:50:48Z" lastUpdateTime: "2023-12-19T01:50:48Z" message: 'operator is not upgradeable: the operatorcondition status "Upgradeable"="True" is outdated' phase: Pending reason: OperatorConditionNotUpgradeable
and if we check the pending operator condition (v4.14.1) we see:
$ oc get operatorcondition kubevirt-hyperconverged-operator.v4.14.1 -o yaml apiVersion: operators.coreos.com/v2 kind: OperatorCondition metadata: creationTimestamp: "2023-12-16T17:10:17Z" generation: 18 labels: operators.coreos.com/kubevirt-hyperconverged.openshift-cnv: "" name: kubevirt-hyperconverged-operator.v4.14.1 namespace: openshift-cnv ownerReferences: - apiVersion: operators.coreos.com/v1alpha1 blockOwnerDeletion: false controller: true kind: ClusterServiceVersion name: kubevirt-hyperconverged-operator.v4.14.1 uid: 7db79d4b-e69e-4af8-9335-6269cf004440 resourceVersion: "4116127" uid: 347306c9-865a-42b8-b2c9-69192b0e350a spec: conditions: - lastTransitionTime: "2023-12-18T18:47:23Z" message: "" reason: Upgradeable status: "True" type: Upgradeable deployments: - hco-operator - hco-webhook - hyperconverged-cluster-cli-download - cluster-network-addons-operator - virt-operator - ssp-operator - cdi-operator - hostpath-provisioner-operator - mtq-operator serviceAccounts: - hyperconverged-cluster-operator - cluster-network-addons-operator - kubevirt-operator - ssp-operator - cdi-operator - hostpath-provisioner-operator - mtq-operator - cluster-network-addons-operator - kubevirt-operator - ssp-operator - cdi-operator - hostpath-provisioner-operator - mtq-operator status: conditions: - lastTransitionTime: "2023-12-18T09:41:06Z" message: "" observedGeneration: 11 reason: Upgradeable status: "True" type: Upgradeable
where metadata.generation (18) is not in sync with status.conditions[*].observedGeneration (11).
Even manually redacting spec.conditions.lastTransitionTime is causing a change in metadata.generation (as expected) but this doesn't trigger any reconciliation on the OLM and so status.conditions[*].observedGeneration remains at 11.
$ oc get operatorcondition kubevirt-hyperconverged-operator.v4.14.1 -o yaml apiVersion: operators.coreos.com/v2 kind: OperatorCondition metadata: creationTimestamp: "2023-12-16T17:10:17Z" generation: 19 labels: operators.coreos.com/kubevirt-hyperconverged.openshift-cnv: "" name: kubevirt-hyperconverged-operator.v4.14.1 namespace: openshift-cnv ownerReferences: - apiVersion: operators.coreos.com/v1alpha1 blockOwnerDeletion: false controller: true kind: ClusterServiceVersion name: kubevirt-hyperconverged-operator.v4.14.1 uid: 7db79d4b-e69e-4af8-9335-6269cf004440 resourceVersion: "4147472" uid: 347306c9-865a-42b8-b2c9-69192b0e350a spec: conditions: - lastTransitionTime: "2023-12-18T18:47:25Z" message: "" reason: Upgradeable status: "True" type: Upgradeable deployments: - hco-operator - hco-webhook - hyperconverged-cluster-cli-download - cluster-network-addons-operator - virt-operator - ssp-operator - cdi-operator - hostpath-provisioner-operator - mtq-operator serviceAccounts: - hyperconverged-cluster-operator - cluster-network-addons-operator - kubevirt-operator - ssp-operator - cdi-operator - hostpath-provisioner-operator - mtq-operator - cluster-network-addons-operator - kubevirt-operator - ssp-operator - cdi-operator - hostpath-provisioner-operator - mtq-operator status: conditions: - lastTransitionTime: "2023-12-18T09:41:06Z" message: "" observedGeneration: 11 reason: Upgradeable status: "True" type: Upgradeable
since its observedGeneration is out of sync, this check:
https://github.com/operator-framework/operator-lifecycle-manager/blob/master/pkg/controller/operators/olm/operatorconditions.go#L44C1-L48
fails and the upgrade never starts.
I suspect (I'm only guessing) that it could be a regression introduced with the memory optimization for https://issues.redhat.com/browse/OCPBUGS-17157 .
Version-Release number of selected component (if applicable):
OCP 4.15.0-ec.3
How reproducible:
- Not reproducible (with the same CNV bundles) on OCP v4.14.z. - Pretty high (but not 100%) on OCP 4.15.0-ec.3
Steps to Reproduce:
1. Try triggering a CNV v4.14.1 -> v4.15.0 on OCP 4.15.0-ec.3 2. 3.
Actual results:
The OLM is not reacting to changes on spec.conditions on the pending operator condition, so metadata.generation is constantly out of sync with status.conditions[*].observedGeneration and so the CSV is reported as message: 'operator is not upgradeable: the operatorcondition status "Upgradeable"="True" is outdated' phase: Pending reason: OperatorConditionNotUpgradeable
Expected results:
The OLM correctly reconcile the operatorCondition and the upgrade starts
Additional info:
Not reproducible with exactly the same bundle (origin and target) on OCP v4.14.z
This is a clone of issue OCPBUGS-30124. The following is the description of the original issue:
—
Description of problem:
In https://issues.redhat.com/browse/OCPBUGS-28625?focusedId=24056681&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-24056681 , Seth Jennings states "It is not required to set the oauthMetadata to enable external OIDC".
Today having a chance to try without setting oauthMetadata, hit oc login fails with the error:
$ oc login --exec-plugin=oc-oidc --client-id=$CLIENT_ID --client-secret=$CLIENT_SECRET_VALUE --extra-scopes=email --callback-port=8080 error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused Unable to connect to the server: getting credentials: exec: executable oc failed with exit code 1
Console login can succeed, though.
Note, OCM QE also encounters this when using ocm cli to test ROSA HCP external OIDC. Either oc or HCP, or anywhere (as a tester I'm not sure TBH ), worthy to have a fix, otherwise oc login is affected.
Version-Release number of selected component (if applicable):
[xxia@2024-03-01 21:03:30 CST my]$ oc version --client Client Version: 4.16.0-0.ci-2024-03-01-033249 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 [xxia@2024-03-01 21:03:50 CST my]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.ci-2024-02-29-213249 True False 8h Cluster version is 4.16.0-0.ci-2024-02-29-213249
How reproducible:
Always
Steps to Reproduce:
1. Launch fresh HCP cluster. 2. Login to https://entra.microsoft.com. Register application and set properly. 3. Prepare variables. HC_NAME=hypershift-ci-267920 MGMT_KUBECONFIG=/home/xxia/my/env/xxia-hs416-2-267920-4.16/kubeconfig HOSTED_KUBECONFIG=/home/xxia/my/env/xxia-hs416-2-267920-4.16/hypershift-ci-267920.kubeconfig AUDIENCE=7686xxxxxx ISSUER_URL=https://login.microsoftonline.com/64dcxxxxxxxx/v2.0 CLIENT_ID=7686xxxxxx CLIENT_SECRET_VALUE="xxxxxxxx" CLIENT_SECRET_NAME=console-secret 4. Configure HC without oauthMetadata. [xxia@2024-03-01 20:29:21 CST my]$ oc create secret generic console-secret -n clusters --from-literal=clientSecret=$CLIENT_SECRET_VALUE --kubeconfig $MGMT_KUBECONFIG [xxia@2024-03-01 20:34:05 CST my]$ oc patch hc $HC_NAME -n clusters --kubeconfig $MGMT_KUBECONFIG --type=merge -p=" spec: configuration: authentication: oauthMetadata: name: '' oidcProviders: - claimMappings: groups: claim: groups prefix: 'oidc-groups-test:' username: claim: email prefixPolicy: Prefix prefix: prefixString: 'oidc-user-test:' issuer: audiences: - $AUDIENCE issuerURL: $ISSUER_URL name: microsoft-entra-id oidcClients: - clientID: $CLIENT_ID clientSecret: name: $CLIENT_SECRET_NAME componentName: console componentNamespace: openshift-console type: OIDC " Wait pods to renew: [xxia@2024-03-01 20:52:41 CST my]$ oc get po -n clusters-$HC_NAME --kubeconfig $MGMT_KUBECONFIG --sort-by metadata.creationTimestamp ... certified-operators-catalog-7ff9cffc8f-z5dlg 1/1 Running 0 5h44m kube-apiserver-6bd9f7ccbd-kqzm7 5/5 Running 0 17m kube-apiserver-6bd9f7ccbd-p2fw7 5/5 Running 0 15m kube-apiserver-6bd9f7ccbd-fmsgl 5/5 Running 0 13m openshift-apiserver-7ffc9fd764-qgd4z 3/3 Running 0 11m openshift-apiserver-7ffc9fd764-vh6x9 3/3 Running 0 10m openshift-apiserver-7ffc9fd764-b7znk 3/3 Running 0 10m konnectivity-agent-577944765c-qxq75 1/1 Running 0 9m42s hosted-cluster-config-operator-695c5854c-dlzwh 1/1 Running 0 9m42s cluster-version-operator-7c99cf68cd-22k84 1/1 Running 0 9m42s konnectivity-agent-577944765c-kqfpq 1/1 Running 0 9m40s konnectivity-agent-577944765c-7t5ds 1/1 Running 0 9m37s 5. Check console login and oc login. $ export KUBECONFIG=$HOSTED_KUBECONFIG $ curl -ksS $(oc whoami --show-server)/.well-known/oauth-authorization-server { "issuer": "https://:0", "authorization_endpoint": "https://:0/oauth/authorize", "token_endpoint": "https://:0/oauth/token", ... } Check console login, it succeeds, console upper right shows correctly user name oidc-user-test:xxia@redhat.com. Check oc login: $ rm -rf ~/.kube/cache/oc/ $ oc login --exec-plugin=oc-oidc --client-id=$CLIENT_ID --client-secret=$CLIENT_SECRET_VALUE --extra-scopes=email --callback-port=8080 error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused error: oidc authenticator error: oidc discovery error: Get "https://:0/.well-known/openid-configuration": dial tcp :0: connect: connection refused Unable to connect to the server: getting credentials: exec: executable oc failed with exit code 1
Actual results:
Console login succeeds. oc login fails.
Expected results:
oc login should also succeed.
Additional info:{}
Description of problem:
network-tools -h error: You must be logged in to the server (Unauthorized) error: You must be logged in to the server (Unauthorized) Usage: network-tools [command]
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/623
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The unit test didn't cover a scenario when hosts are provided without any interfaces in the agent-config.yaml
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
No unit test
Expected results:
A valid unit test which tests the error message "at least one interface must be defined for each node"
Additional info:
Please review the following PR: https://github.com/openshift/cluster-autoscaler-operator/pull/286
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The script rh-manifest.sh in Openshift/Thanos stops working, generating empty dependency list.
Version-Release number of selected component (if applicable):
How reproducible:
Run script/rh-manifest.sh in Openshift/Thanos and check rh-manifest.txt.
Steps to Reproduce:
1. 2. 3.
Actual results:
The generated rh-manifest.txt is empty.
Expected results:
The generated rh-manifest.txt should list Javascript dependencies.
Additional info:
Description of problem:
Creating a pipelinerun with previous annotations leads to the result not being created. But records are updated with new taskruns.
https://github.com/tektoncd/results/issues/556
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Install TektonResults on the cluster 2. Create a Pipeline and start the Pipeline 3. Rerun the PipelineRun 3. Check the records endpoint. eg: https://tekton-results-api-service-openshift-pipelines.apps.viraj-11-10-2023.devcluster.openshift.com/apis/results.tekton.dev/v1alpha2/parents/viraj/results/-/records the new PipelineRun is not get saved.
Actual results:
New PipelineRun get created after the rerun is not get saved in the records
Expected results:
All PipelineRun should be saved in the records
Additional info:
Document to install TektonResults on the cluster https://gist.github.com/vikram-raj/257d672a38eb2159b0368eaed8f8970a
Description of problem:
Excessive permissions in web-console impersonating a user
Version-Release number of selected component (if applicable):
4.10.55
How reproducible:
when trying to impersonate a specific user ('99GU8710') in an OCP 4.10.55 cluster, we are able to see pods and logs in web console and that user is unable to access these things using the command line.
Steps to Reproduce:
1. Create a user with LDAP (example: new_user) 2. Don't give user access to check pod logs for openhshift related namespaces ( For example: new_user should not be able to see pod logs for openhsift-apiserver) 3. Try to impersonate the user (new_user) 4. Try to check openshift-apiserver pod logs through command line( you will be able to see those) 5. Try to check the same logs from command line for new_user , you won't be able to see it.
Actual results:
`Impersonate the user` feature doesn't give correct validation
Expected results:
We should not be able to see pod logs if user does not have permission
Additional info:
Description of problem:
revert "force cert rotation every couple days for development" in 4.15 Below is the steps to verify this bug: # oc adm release info --commits registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-06-25-081133|grep -i cluster-kube-apiserver-operator cluster-kube-apiserver-operator https://github.com/openshift/cluster-kube-apiserver-operator 7764681777edfa3126981a0a1d390a6060a840a3 # git log --date local --pretty="%h %an %cd - %s" 776468 |grep -i "#1307" 08973b820 openshift-ci[bot] Thu Jun 23 22:40:08 2022 - Merge pull request #1307 from tkashem/revert-cert-rotation # oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-06-25-081133 True False 64m Cluster version is 4.11.0-0.nightly-2022-06-25-081133 $ cat scripts/check_secret_expiry.sh FILE="$1" if [ ! -f "$1" ]; then echo "must provide \$1" && exit 0 fi export IFS=$'\n' for i in `cat "$FILE"` do if `echo "$i" | grep "^#" > /dev/null`; then continue fi NS=`echo $i | cut -d ' ' -f 1` SECRET=`echo $i | cut -d ' ' -f 2` rm -f tls.crt; oc extract secret/$SECRET -n $NS --confirm > /dev/null echo "Check cert dates of $SECRET in project $NS:" openssl x509 -noout --dates -in tls.crt; echo done $ cat certs.txt openshift-kube-controller-manager-operator csr-signer-signer openshift-kube-controller-manager-operator csr-signer openshift-kube-controller-manager kube-controller-manager-client-cert-key openshift-kube-apiserver-operator aggregator-client-signer openshift-kube-apiserver aggregator-client openshift-kube-apiserver external-loadbalancer-serving-certkey openshift-kube-apiserver internal-loadbalancer-serving-certkey openshift-kube-apiserver service-network-serving-certkey openshift-config-managed kube-controller-manager-client-cert-key openshift-config-managed kube-scheduler-client-cert-key openshift-kube-scheduler kube-scheduler-client-cert-key Checking the Certs, they are with one day expiry times, this is as expected. # ./check_secret_expiry.sh certs.txt Check cert dates of csr-signer-signer in project openshift-kube-controller-manager-operator: notBefore=Jun 27 04:41:38 2022 GMT notAfter=Jun 28 04:41:38 2022 GMT Check cert dates of csr-signer in project openshift-kube-controller-manager-operator: notBefore=Jun 27 04:52:21 2022 GMT notAfter=Jun 28 04:41:38 2022 GMT Check cert dates of kube-controller-manager-client-cert-key in project openshift-kube-controller-manager: notBefore=Jun 27 04:52:26 2022 GMT notAfter=Jul 27 04:52:27 2022 GMT Check cert dates of aggregator-client-signer in project openshift-kube-apiserver-operator: notBefore=Jun 27 04:41:37 2022 GMT notAfter=Jun 28 04:41:37 2022 GMT Check cert dates of aggregator-client in project openshift-kube-apiserver: notBefore=Jun 27 04:52:26 2022 GMT notAfter=Jun 28 04:41:37 2022 GMT Check cert dates of external-loadbalancer-serving-certkey in project openshift-kube-apiserver: notBefore=Jun 27 04:52:26 2022 GMT notAfter=Jul 27 04:52:27 2022 GMT Check cert dates of internal-loadbalancer-serving-certkey in project openshift-kube-apiserver: notBefore=Jun 27 04:52:49 2022 GMT notAfter=Jul 27 04:52:50 2022 GMT Check cert dates of service-network-serving-certkey in project openshift-kube-apiserver: notBefore=Jun 27 04:52:28 2022 GMT notAfter=Jul 27 04:52:29 2022 GMT Check cert dates of kube-controller-manager-client-cert-key in project openshift-config-managed: notBefore=Jun 27 04:52:26 2022 GMT notAfter=Jul 27 04:52:27 2022 GMT Check cert dates of kube-scheduler-client-cert-key in project openshift-config-managed: notBefore=Jun 27 04:52:47 2022 GMT notAfter=Jul 27 04:52:48 2022 GMT Check cert dates of kube-scheduler-client-cert-key in project openshift-kube-scheduler: notBefore=Jun 27 04:52:47 2022 GMT notAfter=Jul 27 04:52:48 2022 GMT # # cat check_secret_expiry_within.sh #!/usr/bin/env bash # usage: ./check_secret_expiry_within.sh 1day # or 15min, 2days, 2day, 2month, 1year WITHIN=${1:-24hours} echo "Checking validity within $WITHIN ..." oc get secret --insecure-skip-tls-verify -A -o json | jq -r '.items[] | select(.metadata.annotations."auth.openshift.io/certificate-not-after" | . != null and fromdateiso8601<='$( date --date="+$WITHIN" +%s )') | "\(.metadata.annotations."auth.openshift.io/certificate-not-before") \(.metadata.annotations."auth.openshift.io/certificate-not-after") \(.metadata.namespace)\t\(.metadata.name)"' # ./check_secret_expiry_within.sh 1day Checking validity within 1day ... 2022-06-27T04:41:37Z 2022-06-28T04:41:37Z openshift-kube-apiserver-operator aggregator-client-signer 2022-06-27T04:52:26Z 2022-06-28T04:41:37Z openshift-kube-apiserver aggregator-client 2022-06-27T04:52:21Z 2022-06-28T04:41:38Z openshift-kube-controller-manager-operator csr-signer 2022-06-27T04:41:38Z 2022-06-28T04:41:38Z openshift-kube-controller-manager-operator csr-signer-signer
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Issue 30 from https://docs.google.com/spreadsheets/d/1TR3ENY-GE_LQL9F-xH6NahtRHdu6IrYGhLqDDA8y0EI/edit#gid=1035185624
In Helm page, on click of README link, margin spacing is missing in all direction
Screenshot: https://drive.google.com/file/d/1pYFsVxJrB4m2s7pYuw1QeTW3j38A_fRT/view?usp=drive_link
This is a clone of issue OCPBUGS-25699. The following is the description of the original issue:
—
Description of problem:
If GloballyDisableIrqLoadBalancing in disabled in the performance profile then irqs should be balanced across all cpus minus the cpus that are explicitly removed by crio via the pod annotation irq-load-balancing.crio.io: "disable" We have found a number of issues with this: 1) The script clear-irqbalance-banned-cpus.sh is setting an empty value for IRQBALANCE_BANNED_CPUS in /etc/sysconfig/irqbalance. If no value is provided, irqbalance will calculate a default. The default will exclude all isolated and nohz_full cpus from the mask resulting in the irq’s being balanced over the reserved cpus only, breaking the user intent. If a guaranteed pod with the irq-load-balancing.crio.io: "disable” annotation gets launched then irqbalance will heal the system but if one never does then all irqs will be affined to the reserved cores. This script needs to set the banned mask to 0’s on startup. 2) The more serious issue, the scheduler plugin in tuned will attempt to affine all irqs to the non-isolated cores. Isolated here means non-reserved, not truly isolated cores. This is directly at odds with the user intent. So now we have tuned fighting with crio/irqbalance both trying to do different things. Scenarios - If a pod get’s launched with the annotation after tuned has started, runtime or after a reboot - ok - On a reboot if tuned recovers after the guaranteed pod has been launched - broken - If tuned restarts at runtime for any reason - broken 3) Lastly the crio restore of the irqbalance mask needs to be removed. Disabling this should be part of the crio conf that is installed by the NTO.
Version-Release number of selected component (if applicable):
4.14 and likely earlier
How reproducible:
See description
Steps to Reproduce:
1.See description 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-28666. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
https://github.com/kubernetes/kubernetes/issues/118916
Version-Release number of selected component (if applicable):
4.14
How reproducible:
100%
Steps to Reproduce:
1. compare memory usage from v1 and v2 and notice differences with the same workloads 2. 3.
Actual results:
they slightly differ because of accounting differences
Expected results:
they should be largely the same
Additional info:
Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver/pull/42
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
If secure boot is currently disabled, and user attempts to enable it via ZTP, install will not begin the first time ZTP was triggered.
When secure boot is enabled viz ZTP, then boot options will be configured before virtual CD was attached, thus first boot will be booting into existing HD with secure boot on. Install will then get stuck because boot from CD was never triggered.
Version-Release number of selected component (if applicable):
4.10
How reproducible:
Always
Steps to Reproduce:
1. Secure boot is currently disabled in bios
2. Attempt to deploy a cluster with secure boot enabled via ZTP
3.
Actual results:
Expected results:
Additional info:
Secure boot config used in ZTP siteconfig:
http://registry.kni-qe-0.lab.eng.rdu2.redhat.com:3000/kni-qe/ztp-site-configs/src/ff814164cdcd355ed980f1edf269dbc2afbe09aa/siteconfig/master-2.yaml#L40
OCP 4.14.0-rc.0
advanced-cluster-management.v2.9.0-130
multicluster-engine.v2.4.0-154
After encountering https://issues.redhat.com/browse/OCPBUGS-18959
Attempted to forcefully delete the BMH by removing the finalizer.
Then deleted all the metal3 pods.
Attempted to re-create the bmh.
Result:
the bmh is stuck in
oc get bmh
NAME STATE CONSUMER ONLINE ERROR AGE
hp-e910-01.kni-qe-65.lab.eng.rdu2.redhat.com registering true 15m
seeing this entry in the BMO log:
{"level":"info","ts":"2023-09-13T16:15:57Z","logger":"controllers.BareMetalHost","msg":"start","baremetalhost":{"name":"hp-e910-01.kni-qe-65.lab.eng.rdu2.redhat.com","namespace":"kni-qe-65"}}
{"level":"info","ts":"2023-09-13T16:15:57Z","logger":"controllers.BareMetalHost","msg":"hardwareData is ready to be deleted","baremetalhost":{"name":"hp-e910-01.kni-qe-65.lab.eng.rdu2.redhat.com","namespace":"kni-qe-65"}}
{"level":"info","ts":"2023-09-13T16:15:57Z","logger":"controllers.BareMetalHost","msg":"host ready to be powered off","baremetalhost":
,"provisioningState":"powering off before delete"}
{"level":"info","ts":"2023-09-13T16:15:57Z","logger":"provisioner.ironic","msg":"ensuring host is powered off (mode: hard)","host":"kni-qe-65~hp-e910-01.kni-qe-65.lab.eng.rdu2.redhat.com"}{"level":"error","ts":"2023-09-13T16:15:57Z","msg":"Reconciler error","controller":"baremetalhost","controllerGroup":"metal3.io","controllerKind":"BareMetalHost","BareMetalHost":
{"name":"hp-e910-01.kni-qe-65.lab.eng.rdu2.redhat.com","namespace":"kni-qe-65"},"namespace":"kni-qe-65","name":"hp-e910-01.kni-qe-65.lab.eng.rdu2.redhat.com","reconcileID":"167061cc-7ab4-4c4a-ae45-8c19dfc3ac22","error":"action \"powering off before delete\" failed: failed to power off before deleting node: Host not registered","errorVerbose":"Host not registered\nfailed to power off before deleting node\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).actionPowerOffBeforeDeleting\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:493\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).handlePoweringOffBeforeDelete\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:585\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*hostStateMachine).ReconcileState\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/host_state_machine.go:202\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:225\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1598\naction \"powering off before delete\" failed\ngithub.com/metal3-io/baremetal-operator/controllers/metal3%2eio.(*BareMetalHostReconciler).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/controllers/metal3.io/baremetalhost_controller.go:229\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:118\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:314\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226\nruntime.goexit\n\t/usr/lib/golang/src/runtime/asm_amd64.s:1598","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:324\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:265\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/metal3-io/baremetal-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:226"}
Description of problem:
When the user configures the install-config.yaml additionalTrustBundle field (for example, in a disconnected installation using a local registry), the user-ca-bundle configmap gets populated with more content than strictly required
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Setup a local registry and mirror the content of an ocp release 2. Configure the install-config.yaml for a mirrored installation. In particular, configure the additionalTrustBundle field with the registry cert 3. Create the agent ISO, boot the nodes and wait for the installation to complete
Actual results:
The user-ca-bundle cm does not contain onyl the registry cert
Expected results:
user-ca-bundle configmap with just the content of the install-config additionalTrustBundle field
Additional info:
Environment: OCP 4.12.24
Installation Method: IPI: Manual Mode + STS using a customer provider AWS IAM Role
I am trying to deploy an OCP4 cluster on AWS for my customer. The customer does not permit creation of IAM users so I am performing a Manual Mode with STS IPI installation instead. I have been given an IAM role to assume for the OCP installation, but unfortunately the customer's AWS Organizational Service Control Policy (SCP) does not permit the use of the iam:GetUser{} permission.
(I have informed my customer that iam:GetUser is an installation requirement - it's clearly documented in our docs, and I have raised a ticket with their internal support team requesting that their SCP is amended to include iam:getUser, however I have been informed that my request is likely to be rejected).
With this limitation understood, I still attempted to install OCP4. Surprisingly, I was able to deploy an OCP (4.12) cluster without any apparent issues, however when I tried to destroy the cluster I encountered the following error from the installer (note: fields in brackets <> have been redacted):
DEBUG search for IAM roles
DEBUG iterating over a page of 74 IAM roles
DEBUG search for IAM users
DEBUG iterating over a page of 1 IAM users
INFO get tags for <ARN of the IAM user>: AccessDenied: User:<ARN of my user> is notauthorized to perform: iam:GetUser on resource: <IAMusername> with an explicit deny in a service control policy
INFO status code: 403, request id: <request ID>
DEBUG search for IAM instance profiles
INFO error while finding resources to delete error=get tags for <ARN of IAM user> AccessDenied: User:<ARN of my user> is not authorized to perform: iam:GetUser on resource: <IAM username> with an explicit deny in a service control policy status code: 403, request id: <request ID>
Similarly, the error in AWS CloudTrail logs shows the following (note: some fields in brackets have been redacted):
User: arn:aws:sts::<AWS account no>:assumed-role/<role-name>/<user name> is not authorized to perform: iam:GetUser on resource <IAM User> with an explicit deny in a service control policy
It appears that the destroy operation is failing when the installer is trying to list tags on the only IAM user in the customer's AWS account. As discussed, the SCP does not permit the use of iam:GetUser and consequently this API call on the IAM user is denied. The installer then enters an endless loop as it continuously retries the operation. We have potentially identified the iamUserSearch function within the installer code at pkg/destroy/aws/iamhelpers.go as the area where this call is failing.
There does not appear to be a handler for "AccessDenied" API error in this function. Therefore we request that the access denied event is gracefully handled and skipped over when processing IAM users, allowing the installer to continue with the destroy operation, much in the same way that a similar access denied event is handled within the iamRoleSearch function when processing IAM roles:
We therefore request that the following is considered and addressed:
1. Re-assess if the iam:GetUser permission is actually needed for cluster installation/cluster operations.
2. If the permission is required then the installer should provide a warning or halt the installation.
2. During a "destroy" cluster operation - the installer should gracefully handle AccessDenied errors from the API and "skip over" any IAM Users that the installer does not have permission to list tags for and then continue gracefully with the destroy operation.
Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver/pull/99
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Description of problem:
Observation from CISv1.4 pdf: 1.1.9 Ensure that the Container Network Interface file permissions are set to 600 or more restrictive "Container Network Interface provides various networking options for overlay networking. You should consult their documentation and restrict their respective file permissions to maintain the integrity of those files. Those files should be writable by only the administrators on the system." To conform with CIS benchmarksChange, the /var/lib/cni/networks/openshift-sdn files in all sdn pods should be updated to 600. $ for i in $(oc get pods -n openshift-sdn -l app=sdn -oname); do oc exec -n openshift-sdn $i -- find /var/lib/cni/networks/openshift-sdn -type f -exec stat -c %a {} \;; done Defaulted container "sdn" out of: sdn, kube-rbac-proxy 644 644 644 644 644 644 644 644 644 644 644 644 644 Defaulted container "sdn" out of: sdn, kube-rbac-proxy 644 644 644 644 644 644 644 644 644 644 644 644 644 Defaulted container "sdn" out of: sdn, kube-rbac-proxy 644 644 644 644 644 644 644 644 644 644 644 644 Defaulted container "sdn" out of: sdn, kube-rbac-proxy 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 Defaulted container "sdn" out of: sdn, kube-rbac-proxy 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 Defaulted container "sdn" out of: sdn, kube-rbac-proxy 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644 644
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-20-215234
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
The file permissions for /var/lib/cni/networks/openshift-sdn files in all sdn pods is 644
Expected results:
The file permissions for /var/lib/cni/networks/openshift-sdn files in all sdn pods should be updated to 600
Additional info:
OCP 4.11 ships the alertrelabelconfigs CRD as a techpreview feature. Before graduating to GA we need to have e2e tests in the CMO repository.
AC:
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/242
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/oc/pull/1545
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The OpenShift DNS daemonset has the rolling update strategy. The "maxSurge" parameter is set to a non zero value which means that the "maxUnavailable" parameter is set to zero. When the user replaces the toleration in the daemonset's template spec (via the OpenShift DNS config API) from the one which helps to be scheduled on the master node into any other toleration: the new pods are still trying to be scheduled on the master nodes. The old pods from the tolerated nodes can be lucky enough to be recreated but only if they go before any pod from the intolerable node. The new pods are not expected to be scheduled on the nodes which are not tolerated by the new damonset's template spec. The daemonset controller should just delete the old pods from the nodes which cannot be tolerated anymore. The old pods from the nodes which can still be tolerated should be recreated according to the rolling update parameters.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create the daemonset which tolerates "node-role.kubernetes.io/master" taint and has the following rolling update parameters:
$ oc -n openshift-dns get ds dns-default -o yaml | yq .spec.updateStrategy rollingUpdate: maxSurge: 10% maxUnavailable: 0 type: RollingUpdate $ oc -n openshift-dns get ds dns-default -o yaml | yq .spec.template.spec.tolerations - key: node-role.kubernetes.io/master operator: Exists
2. Let the daemonset to be scheduled on all the target nodes (e.g. all masters and all workers)
$ oc -n openshift-dns get pods -o wide | grep dns-default dns-default-6bfmf 2/2 Running 0 119m 10.129.0.40 ci-ln-sb5ply2-72292-qlhc8-master-2 <none> <none> dns-default-9cjdf 2/2 Running 0 2m35s 10.129.2.15 ci-ln-sb5ply2-72292-qlhc8-worker-c-m5wzq <none> <none> dns-default-c6j9x 2/2 Running 0 119m 10.128.0.13 ci-ln-sb5ply2-72292-qlhc8-master-0 <none> <none> dns-default-fhqrs 2/2 Running 0 2m12s 10.131.0.29 ci-ln-sb5ply2-72292-qlhc8-worker-a-6q7hs <none> <none> dns-default-lx2nf 2/2 Running 0 119m 10.130.0.15 ci-ln-sb5ply2-72292-qlhc8-master-1 <none> <none> dns-default-mmc78 2/2 Running 0 112m 10.128.2.7 ci-ln-sb5ply2-72292-qlhc8-worker-b-bpjdk <none> <none>
3. Update the daemonset's tolerations by removing "node-role.kubernetes.io/master" and adding any other toleration (not existing works too):
$ oc -n openshift-dns get ds dns-default -o yaml | yq .spec.template.spec.tolerations - key: test-taint operator: Exists
Actual results:
$ oc -n openshift-dns get pods -o wide | grep dns-default dns-default-6bfmf 2/2 Running 0 124m 10.129.0.40 ci-ln-sb5ply2-72292-qlhc8-master-2 <none> <none> dns-default-76vjz 0/2 Pending 0 3m2s <none> <none> <none> <none> dns-default-9cjdf 2/2 Running 0 7m24s 10.129.2.15 ci-ln-sb5ply2-72292-qlhc8-worker-c-m5wzq <none> <none> dns-default-c6j9x 2/2 Running 0 124m 10.128.0.13 ci-ln-sb5ply2-72292-qlhc8-master-0 <none> <none> dns-default-fhqrs 2/2 Running 0 7m1s 10.131.0.29 ci-ln-sb5ply2-72292-qlhc8-worker-a-6q7hs <none> <none> dns-default-lx2nf 2/2 Running 0 124m 10.130.0.15 ci-ln-sb5ply2-72292-qlhc8-master-1 <none> <none> dns-default-mmc78 2/2 Running 0 117m 10.128.2.7 ci-ln-sb5ply2-72292-qlhc8-worker-b-bpjdk <none> <none>
Expected results:
$ oc -n openshift-dns get pods -o wide | grep dns-default dns-default-9cjdf 2/2 Running 0 7m24s 10.129.2.15 ci-ln-sb5ply2-72292-qlhc8-worker-c-m5wzq <none> <none> dns-default-fhqrs 2/2 Running 0 7m1s 10.131.0.29 ci-ln-sb5ply2-72292-qlhc8-worker-a-6q7hs <none> <none> dns-default-mmc78 2/2 Running 0 7m54s 10.128.2.7 ci-ln-sb5ply2-72292-qlhc8-worker-b-bpjdk <none> <none>
Additional info:
Upstream issue: https://github.com/kubernetes/kubernetes/issues/118823
Slack discussion: https://redhat-internal.slack.com/archives/CKJR6200N/p1687455135950439
Description of problem:
IPI installation on Alibabacloud cannot succeed, and zero control-plane node ready.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-08-11-055332
How reproducible:
Always
Steps to Reproduce:
1. IPI installation on Alibabacloud, with "credentialsMode: Manual"
Actual results:
Bootstrap failed, with all control-plane nodes NotReady.
Expected results:
The installation should succeed.
Additional info:
The log bundle is available at https://drive.google.com/file/d/1eb1D6GeNyu1Bys6vDyf3ev9aFjzWW6lW/view?usp=drive_link. The installation of exactly the same scenario can succeed with 4.14.0-ec.4-x86_64.
Please review the following PR: https://github.com/openshift/csi-node-driver-registrar/pull/49
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
hypershift dump fails to acquire localhost-kubeconfig when impersonating. When attempting to dump guest cluster, it fails to read Secrets from the HCP namespace on the management cluster. As a result, it can't access anything from the guest cluster and fails to dump it successfully.
Version-Release number of selected component (if applicable):
Hypershift 0.1.11 Supported OCP version 4.15.0
How reproducible:
100%
Steps to Reproduce:
Execute hypershift dump cluster --as backplane-cluster-admin --name ${CLUSTER_NAME} --namespace ocm-${ENVIRONMENT}-${CLUSTER_ID} --dump-guest-cluster --artifact-dir ${DIR_NAME}
Actual results:
After a while a failure message will appear showing permission issue when attempting to acquire localhost-kubeconfig
Expected results:
localhost-kubeconfig should be acquired correctly and dump should be able to dump the guest cluster successfully
Additional info:
Description of problem:
hypershift_nodepools_available_replicas does not properly reflect the nodepool. $ oc get nodepools -n ocm-production-12345678 NAME CLUSTER DESIRED NODES CURRENT NODES AUTOSCALING AUTOREPAIR VERSION UPDATINGVERSION UPDATINGCONFIG MESSAGE re-test-workers re-test 2 0 False True 4.12.35 Minimum availability requires 2 replicas, current 0 available Meanwhile, there are 3 hypershift_nodepools_available_replicas time series for the nodepools: - re-test-worker2 reporting 1 - re-test-worker3 reporting 1 - re-test-workers reporting 0 (accurate) The issue here is the two extra time series, which should not exist if the nodepool doesn't exist.
Version-Release number of selected component (if applicable):
4.12.35
How reproducible:
This particular cluster had its OIDC configuration along with other customer AWS account resources deleted, which might be connected to the misbehaviour of the metric.
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Adding must-gather and metric time series in the ticket
This is a clone of issue OCPBUGS-25810. The following is the description of the original issue:
—
No QA required, updating approvers across releases
Description of problem:
Version-Release number of selected component (if applicable):
OCP 4.13.0-0.nightly-2023-03-23-204038 ODF 4.13.0-121.stable
How reproducible:
Steps to Reproduce:
1. Installed ODF over OCP, everything was fine on the Installed Operators page. 2. Later when checked Installed Operators page, it crashed with "Oh no! Something went wrong" error. 3.
Actual results:
Installed Operators page crashes with "Oh no! Something went wrong." error
Expected results:
Installed Operators page shouldn't crash Component and Stack trace logs from the console page- http://pastebin.test.redhat.com/1096522
Additional info:
Please review the following PR: https://github.com/openshift/ironic-static-ip-manager/pull/40
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/azure-file-csi-driver-operator/pull/74
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
Minor Issue :
testing api functions to add manifest to cluster , noticed that for invalid file names we normally get
status=422,
reason="Unprocessable Entity",
however for the file name : "sp ce.yaml "
we get 400 not 422 , and general Bad Request entity
Reason: Bad Request
HTTP response headers: HTTPHeaderDict(
)
HTTP response body:
I believe it is better to align this with the same exception we getting (for example when creating file with invalid file extension , or file name which already exist (422)
How reproducible:
Steps to reproduce:
1. try to create via api v2_create_cluster_manifest manifest with the name "sp ce.yaml"
2.
3.
Actual results:
getting 400 , Badrequest
Expected results:
422 , reason="Unprocessable Entity",
Please review the following PR: https://github.com/openshift/cluster-kube-apiserver-operator/pull/1550
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
cherry-pick of https://github.com/openshift/cluster-image-registry-operator/pull/955
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When the IPI installer creates a service instance for the user, PowerVS will now have the type as composite_instance rather than service_instance. Fixup delete cluster to account for this change.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create cluster 2. Destroy cluster 3.
Actual results:
The newly created service instance does not delete.
Expected results:
Additional info:
The cluster-ingress-operator repository vendors controller-runtime v0.15.0, which uses Kubernetes 1.27 packages. OpenShift 4.15 is based on Kubernetes 1.28.
4.15.
Always.
Check https://github.com/openshift/cluster-ingress-operator/blob/release-4.15/go.mod.
The sigs.k8s.io/controller-runtime package is at v0.15.0.
The sigs.k8s.io/controller-runtime package is at v0.16.0 or newer.
https://github.com/openshift/cluster-ingress-operator/pull/990 already bumped the k8s.io/* packages to v0.28.2, but ideally the controller-runtime package should be bumped too. The controller-runtime v0.16 release includes some breaking changes; see the release notes at https://github.com/kubernetes-sigs/controller-runtime/releases/tag/v0.16.0.
Description of problem:
To ensure pods run in separate zone for a hypershift cluster, a PodAntiAffinity spec should be provided.
Version-Release number of selected component (if applicable):
4.12, 4.13, 4.14
How reproducible:
Always
Steps to Reproduce:
1. Create a hypershift control plane in ha mode. 2. Observe the multus admission controller pods. 3.
Actual results:
Not all pods scheduled on separate zones.
Expected results:
Pods scheduled on separate zones.
Additional info:
Work with Kyl on getting RHTAP setup with release-4.15
When platform specific passwords are included in the install-config.yaml they are stored in the generated agent-cluster-install.yaml, which is included in the output of the agent-gather command. These passwords should be redacted.
Please review the following PR: https://github.com/openshift/images/pull/153
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
From our initial investigation, it seems like the network-node-identity component does not need management cluster access in Hypershift We were looking at: https://github.com/openshift/cluster-network-operator/blob/release-4.14/bindata/network/node-identity/managed/node-identity.yaml For the webhook and approver container: https://github.com/openshift/ovn-kubernetes/blob/release-4.14/go-controller/cmd/ovnkube-identity/ovnkubeidentity.go For the token minter container: https://github.com/openshift/hypershift/blob/release-4.14/token-minter/tokenminter.go We also tested by disabling the automountserviceaccounttoken and things still seemed to be functioning
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Deploy a 4.14 hosted cluster 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-27509. The following is the description of the original issue:
—
When a MachineAutoscaler references a currently-zero-Machine MachineSet that includes spec.template.spec.taints, the autoscaler fails to deserialize that MachineSet, which causes it to fail to autoscale that MachineSet. The autoscaler's deserialization logic should be improved to avoid failing on the presence of taints.
Reproduced on 4.14.10 and 4.16.0-ec.1. Expected to be every release going back to at least 4.12, based on code inspection.
Always.
With a launch 4.14.10 gcp Cluster Bot cluster (logs):
$ oc adm upgrade Cluster version is 4.14.10 Upstream: https://api.integration.openshift.com/api/upgrades_info/graph Channel: candidate-4.14 (available channels: candidate-4.14, candidate-4.15) No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available. $ oc -n openshift-machine-api get machinesets.machine.openshift.io NAME DESIRED CURRENT READY AVAILABLE AGE ci-ln-s48f02k-72292-5z2hn-worker-a 1 1 1 1 29m ci-ln-s48f02k-72292-5z2hn-worker-b 1 1 1 1 29m ci-ln-s48f02k-72292-5z2hn-worker-c 1 1 1 1 29m ci-ln-s48f02k-72292-5z2hn-worker-f 0 0 29m
Pick that set with 0 nodes. They don't come with taints by default:
$ oc -n openshift-machine-api get -o json machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f | jq '.spec.template.spec.taints' null
So patch one in:
$ oc -n openshift-machine-api patch machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f --type json -p '[{"op": "add", "path": "/spec/template/spec/taints", "value": [{"effect":"NoSchedule","key":"node-role.kubernetes.io/ci","value":"ci"} ]}]' machineset.machine.openshift.io/ci-ln-s48f02k-72292-5z2hn-worker-f patched
And set up autoscaling:
$ cat cluster-autoscaler.yaml apiVersion: autoscaling.openshift.io/v1 kind: ClusterAutoscaler metadata: name: default spec: maxNodeProvisionTime: 30m scaleDown: enabled: true $ oc apply -f cluster-autoscaler.yaml clusterautoscaler.autoscaling.openshift.io/default created
I'm not all that familiar with autoscaling. Maybe the ClusterAutoscaler doesn't matter, and you need a MachineAutoscaler aimed at the chosen MachineSet?
$ cat machine-autoscaler.yaml apiVersion: autoscaling.openshift.io/v1beta1 kind: MachineAutoscaler metadata: name: test namespace: openshift-machine-api spec: maxReplicas: 2 minReplicas: 1 scaleTargetRef: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet name: ci-ln-s48f02k-72292-5z2hn-worker-f $ oc apply -f machine-autoscaler.yaml machineautoscaler.autoscaling.openshift.io/test created
Checking the autoscaler's logs:
$ oc -n openshift-machine-api logs -l k8s-app=cluster-autoscaler --tail -1 | grep taint W0122 19:18:47.246369 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] W0122 19:18:58.474000 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] W0122 19:19:09.703748 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] W0122 19:19:20.929617 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] ...
And the MachineSet is failing to scale:
$ oc -n openshift-machine-api get machinesets.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f NAME DESIRED CURRENT READY AVAILABLE AGE ci-ln-s48f02k-72292-5z2hn-worker-f 0 0 50m
While if I remove the taint:
$ oc -n openshift-machine-api patch machineset.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f --type json -p '[{"op": "remove", "path": "/spec/template/spec/taints"}]' machineset.machine.openshift.io/ci-ln-s48f02k-72292-5z2hn-worker-f patched
The autoscaler... well, it's not scaling up new Machines like I'd expected, but at least it seems to have calmed down about the taint deserialization issue:
$ oc -n openshift-machine-api get machines.machine.openshift.io NAME PHASE TYPE REGION ZONE AGE ci-ln-s48f02k-72292-5z2hn-master-0 Running e2-custom-6-16384 us-central1 us-central1-a 53m ci-ln-s48f02k-72292-5z2hn-master-1 Running e2-custom-6-16384 us-central1 us-central1-b 53m ci-ln-s48f02k-72292-5z2hn-master-2 Running e2-custom-6-16384 us-central1 us-central1-c 53m ci-ln-s48f02k-72292-5z2hn-worker-a-fwskf Running e2-standard-4 us-central1 us-central1-a 45m ci-ln-s48f02k-72292-5z2hn-worker-b-qkwlt Running e2-standard-4 us-central1 us-central1-b 45m ci-ln-s48f02k-72292-5z2hn-worker-c-rlw4m Running e2-standard-4 us-central1 us-central1-c 45m $ oc -n openshift-machine-api get machinesets.machine.openshift.io ci-ln-s48f02k-72292-5z2hn-worker-f NAME DESIRED CURRENT READY AVAILABLE AGE ci-ln-s48f02k-72292-5z2hn-worker-f 0 0 53m $ oc -n openshift-machine-api logs -l k8s-app=cluster-autoscaler --tail 50 I0122 19:23:17.284762 1 static_autoscaler.go:552] No unschedulable pods I0122 19:23:17.687036 1 legacy.go:296] No candidates for scale down W0122 19:23:27.924167 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] I0122 19:23:28.510701 1 static_autoscaler.go:552] No unschedulable pods I0122 19:23:28.909507 1 legacy.go:296] No candidates for scale down W0122 19:23:39.148266 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] I0122 19:23:39.737359 1 static_autoscaler.go:552] No unschedulable pods I0122 19:23:40.135580 1 legacy.go:296] No candidates for scale down W0122 19:23:50.376616 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] I0122 19:23:50.963064 1 static_autoscaler.go:552] No unschedulable pods I0122 19:23:51.364313 1 legacy.go:296] No candidates for scale down W0122 19:24:01.601764 1 clusterapi_unstructured.go:217] Unable to convert data to taint: %vmap[effect:NoSchedule key:node-role.kubernetes.io/ci value:ci] I0122 19:24:02.191330 1 static_autoscaler.go:552] No unschedulable pods I0122 19:24:02.589766 1 legacy.go:296] No candidates for scale down I0122 19:24:13.415183 1 static_autoscaler.go:552] No unschedulable pods I0122 19:24:13.815851 1 legacy.go:296] No candidates for scale down I0122 19:24:24.641190 1 static_autoscaler.go:552] No unschedulable pods I0122 19:24:25.040894 1 legacy.go:296] No candidates for scale down I0122 19:24:35.867194 1 static_autoscaler.go:552] No unschedulable pods I0122 19:24:36.266400 1 legacy.go:296] No candidates for scale down I0122 19:24:47.097656 1 static_autoscaler.go:552] No unschedulable pods I0122 19:24:47.498099 1 legacy.go:296] No candidates for scale down I0122 19:24:58.326025 1 static_autoscaler.go:552] No unschedulable pods I0122 19:24:58.726034 1 legacy.go:296] No candidates for scale down I0122 19:25:04.927980 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache I0122 19:25:04.938213 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 10.036399ms I0122 19:25:09.552086 1 static_autoscaler.go:552] No unschedulable pods I0122 19:25:09.952094 1 legacy.go:296] No candidates for scale down I0122 19:25:20.778317 1 static_autoscaler.go:552] No unschedulable pods I0122 19:25:21.178062 1 legacy.go:296] No candidates for scale down I0122 19:25:32.005246 1 static_autoscaler.go:552] No unschedulable pods I0122 19:25:32.404966 1 legacy.go:296] No candidates for scale down I0122 19:25:43.233637 1 static_autoscaler.go:552] No unschedulable pods I0122 19:25:43.633889 1 legacy.go:296] No candidates for scale down I0122 19:25:54.462009 1 static_autoscaler.go:552] No unschedulable pods I0122 19:25:54.861513 1 legacy.go:296] No candidates for scale down I0122 19:26:05.688410 1 static_autoscaler.go:552] No unschedulable pods I0122 19:26:06.088972 1 legacy.go:296] No candidates for scale down I0122 19:26:16.915156 1 static_autoscaler.go:552] No unschedulable pods I0122 19:26:17.315987 1 legacy.go:296] No candidates for scale down I0122 19:26:28.143877 1 static_autoscaler.go:552] No unschedulable pods I0122 19:26:28.543998 1 legacy.go:296] No candidates for scale down I0122 19:26:39.369085 1 static_autoscaler.go:552] No unschedulable pods I0122 19:26:39.770386 1 legacy.go:296] No candidates for scale down I0122 19:26:50.596923 1 static_autoscaler.go:552] No unschedulable pods I0122 19:26:50.997262 1 legacy.go:296] No candidates for scale down I0122 19:27:01.823577 1 static_autoscaler.go:552] No unschedulable pods I0122 19:27:02.223290 1 legacy.go:296] No candidates for scale down I0122 19:27:04.938943 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache I0122 19:27:04.947353 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 8.319938ms
Scale-from-zero MachineAutoscaler fails on taint-deserialization when the referenced MachineSet contains spec.template.spec.taints.
Scale-from-zero MachineAutoscaler works, even when the referenced MachineSet contains spec.template.spec.taints.
{ 2023-07-19T16:52:37Z reason/ReusedPodIP podIP 10.128.0.39 is currently assigned to multiple pods: ns/e2e-replicaset-4951 pod/test-rs-ddhkn node/ip-10-0-151-233.us-west-1.compute.internal uid/117115dd-dc8f-4333-b972-ed880fcf8dd9;ns/openshift-apiserver pod/apiserver-5f7d4599b4-dvpdk node/ip-10-0-151-233.us-west-1.compute.internal uid/293cba9c-11ea-4258-9d38-4ff5b2cb52bd 2023-07-19T16:58:40Z reason/ReusedPodIP podIP 10.128.0.39 is currently assigned to multiple pods: ns/e2e-job-1076 pod/pod-disruption-failure-ignore-2-qlxp2 node/ip-10-0-151-233.us-west-1.compute.internal uid/3dda8eea-b221-433a-b254-fc7cf487189b;ns/openshift-apiserver pod/apiserver-5f7d4599b4-dvpdk node/ip-10-0-151-233.us-west-1.compute.internal uid/293cba9c-11ea-4258-9d38-4ff5b2cb52bd}
I0719 16:44:56.659916 49761 base_network_controller_pods.go:444] [default/openshift-apiserver/apiserver-5f7d4599b4-dvpdk] creating logical port openshift-apiserver_apiserver-5f7d4599b4-dvpdk for pod on switch ip-10-0-151-233.us-west-1.compute.internal
W0719 16:44:56.666407 49761 base_network_controller_pods.go:198] No cached port info for deleting pod default/openshift-kube-controller-manager/installer-7-ip-10-0-151-233.us-west-1.compute.internal. Using logical switch ip-10-0-151-233.us-west-1.compute.internal port uuid and addrs [10.128.0.39/23]
I0719 16:44:56.680604 49761 base_network_controller_pods.go:234] Releasing IPs for Completed pod: openshift-kube-controller-manager/installer-7-ip-10-0-151-233.us-west-1.compute.internal, ips: 10.128.0.39
I0719 16:44:56.699279 49761 pods.go:134] Attempting to release IPs for pod: openshift-kube-controller-manager/installer-7-ip-10-0-151-233.us-west-1.compute.internal, ips: 10.128.0.39
I0719 16:44:56.790903 49761 client.go:783] "msg"="transacting operations" "database"="OVN_Northbound" "operations"="[\{Op:insert Table:Logical_Switch_Port Row:map[addresses:{GoSet:[0a:58:0a:80:00:27 10.128.0.39]} external_ids:\{GoMap:map[namespace:openshift-apiserver pod:true]} name:openshift-apiserver_apiserver-5f7d4599b4-dvpdk
Observed in
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-net[…]perator-master-e2e-aws-ovn-single-node/1681699276796727296
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_clus[…]netes_ovnkube-node-bsbt9_ovnkube-controller.log
Please review the following PR: https://github.com/openshift/cluster-samples-operator/pull/527
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Check on oauth page(/k8s/cluster/config.openshift.io~v1~OAuth/cluster), there is not table line for idp list now
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-11-22-204142
How reproducible:
Always
Steps to Reproduce:
1.Check on oauth page(/k8s/cluster/config.openshift.io~v1~OAuth/cluster) 2. 3.
Actual results:
1. Miss table line for idp list
Expected results:
1. Should show idp tables
Additional info:
screenshot: https://drive.google.com/file/d/1xmF5_RYZtAfcfY57kWi9ttcahKFFd_Kc/view?usp=sharing
Description of problem:
A user noticed on delete cluster that the IPI generated service instance was not cleaned up. Add more debugging statements to find out why.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create cluster 2. Delete cluster
Actual results:
Expected results:
Additional info:
Description of problem:
It would help making debugging easier if we included the namespace in the message for these alerts: https://github.com/openshift/cluster-ingress-operator/blob/master/manifests/0000_90_ingress-operator_03_prometheusrules.yaml#L69
Version-Release number of selected component (if applicable):
4.12.x
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
No namespace in the alert message
Expected results:
Additional info:
Description of problem:
When use oc-mirror try to band port failed will panic
Version-Release number of selected component (if applicable):
./oc-mirror version Logging to .oc-mirror.log WARNING: This version information is deprecated and will be replaced with the output from --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"", Minor:"", GitVersion:"4.15.0-202311101707.p0.g1c8f538.assembly.stream-1c8f538", GitCommit:"1c8f538897c88011c51ab53ea5073547521f0676", GitTreeState:"clean", BuildDate:"2023-11-10T18:49:00Z", GoVersion:"go1.20.10 X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"}
How reproducible:
always
Steps to Reproduce:
run command : oc-mirror --from file://out docker://localhost:5000/ocptest --v2 --config config.yaml --dest-tls-verify=false
Actual results:
oc-mirror --from file://out docker://localhost:5000/ocptest --v2 --config config.yaml --dest-tls-verify=false
--v2 flag identified, flow redirected to the oc-mirror v2 version. PLEASE DO NOT USE that. V2 is still under development and it is not ready to be used.
2023/11/15 13:04:47 [INFO] : mode diskToMirror
2023/11/15 13:04:47 [INFO] : local storage registry will log to /app1/1106/logs/registry.log
2023/11/15 13:04:47 [INFO] : starting local storage on :5000
panic: listen tcp :5000: bind: address already in use
goroutine 67 [running]:
github.com/openshift/oc-mirror/v2/pkg/cli.panicOnRegistryError(0x0?)
/go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/pkg/cli/executor.go:298 +0x4e
created by github.com/openshift/oc-mirror/v2/pkg/cli.(*ExecutorSchema).PrepareStorageAndLogs
/go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/pkg/cli/executor.go:286 +0x945
Expected results:
Should exit with error but not panic
This is a clone of issue OCPBUGS-27908. The following is the description of the original issue:
—
Description of problem:
Navigation: Workloads -> Deployments -> (select any Deployment from list) -> Details -> Volumes -> Remove volume Issue: Message "Are you sure you want to remove volume audit-policies from Deployment: apiserver?" is in English. Observation: Translation is present in branch release-4.15 file... frontend/public/locales/ja/public.json
Version-Release number of selected component (if applicable):
4.15.0-rc.3
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Content is in English
Expected results:
Content should be in selected language
Additional info:
Reference screenshot attached.
Description of problem:
When running 4.15 installer full function test, detect below three instance families and verified, need to append them in installer doc[1]: - standardHBv4Family - standardMSMediumMemoryv3Family - standardMDSMediumMemoryv3Family [1] https://github.com/openshift/installer/blob/master/docs/user/azure/tested_instance_types_x86_64.md
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
During cluster installations/upgrades with an imageContentSourcePolicy in place but with access to quay.io, the ICSP is not honored to pull the machine-os-content image from a private registry.
Version-Release number of selected component (if applicable):
$ oc logs -n openshift-machine-config-operator ds/machine-config-daemon -c machine-config-daemon|head -1 Found 6 pods, using pod/machine-config-daemon-znknf I0503 10:53:00.925942 2377 start.go:112] Version: v4.12.0-202304070941.p0.g87fedee.assembly.stream-dirty (87fedee690ae487f8ae044ac416000172c9576a5)
How reproducible:
100% in clusters with ICSP configured BUT with access to quay.io
Steps to Reproduce:
1. Create mirror repo: $ cat <<EOF > /tmp/isc.yaml kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 archiveSize: 4 storageConfig: registry: imageURL: quay.example.com/mirror/oc-mirror-metadata skipTLS: true mirror: platform: channels: - name: stable-4.12 type: ocp minVersion: 4.12.13 graph: true EOF $ oc mirror --dest-skip-tls --config=/tmp/isc.yaml docker://quay.example.com/mirror/oc-mirror-metadata <...> info: Mirroring completed in 2m27.91s (138.6MB/s) Writing image mapping to oc-mirror-workspace/results-1683104229/mapping.txt Writing UpdateService manifests to oc-mirror-workspace/results-1683104229 Writing ICSP manifests to oc-mirror-workspace/results-1683104229 2. Confirm machine-os-content digest: $ oc adm release info 4.12.13 -o jsonpath='{.references.spec.tags[?(@.name=="machine-os-content")].from}'|jq { "kind": "DockerImage", "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a1660c8086ff85e569e10b3bc9db344e1e1f7530581d742ad98b670a81477b1b" } $ oc adm release info 4.12.14 -o jsonpath='{.references.spec.tags[?(@.name=="machine-os-content")].from}'|jq { "kind": "DockerImage", "name": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ed68d04d720a83366626a11297a4f3c5761c0b44d02ef66fe4cbcc70a6854563" } 3. Create 4.12.13 cluster with ICSP at install time: $ grep imageContentSources -A6 ./install-config.yaml imageContentSources: - mirrors: - quay.example.com/mirror/oc-mirror-metadata/openshift/release source: quay.io/openshift-release-dev/ocp-v4.0-art-dev - mirrors: - quay.example.com/mirror/oc-mirror-metadata/openshift/release-images source: quay.io/openshift-release-dev/ocp-release
Actual results:
1. After the installation is completed, no pulls for a166 (4.12.13-x86_64-machine-os-content) are logged in the Quay usage logs whereas e.g. digest 22d2 (4.12.13-x86_64-machine-os-images) are reported to be pulled from the mirror. 2. After upgrading to 4.12.14 no pulls for ed68 (4.12.14-x86_64-machine-os-content) are logged in the mirror-registry while the image was pulled as part of `oc image extract` in the machine-config-daemon: [core@master-1 ~]$ sudo less /var/log/pods/openshift-machine-config-operator_machine-config-daemon-7fnjz_e2a3de54-1355-44f9-a516-2f89d6c6ab8f/machine-config-daemon/0.log 2023-05-03T10:51:43.308996195+00:00 stderr F I0503 10:51:43.308932 11290 run.go:19] Running: nice -- ionice -c 3 oc image extract -v 10 --path /:/run/mco-extensions/os-extensions-content-4035545447 --registry- config /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ad48fe01f3e82584197797ce2151eecdfdcce67ae1096f06412e5ace416f66ce 2023-05-03T10:51:43.418211869+00:00 stderr F I0503 10:51:43.418008 184455 client_mirrored.go:174] Attempting to connect to quay.io/openshift-release-dev/ocp-v4.0-art-dev 2023-05-03T10:51:43.418211869+00:00 stderr F I0503 10:51:43.418174 184455 round_trippers.go:466] curl -v -XGET -H "User-Agent: oc/4.12.0 (linux/amd64) kubernetes/31aa3e8" 'https://quay.io/v2/' 2023-05-03T10:51:43.419618513+00:00 stderr F I0503 10:51:43.419517 184455 round_trippers.go:495] HTTP Trace: DNS Lookup for quay.io resolved to [{34.206.15.82 } {54.209.210.231 } {52.5.187.29 } {52.3.168.193 } {52.21.36.23 } {50.17.122.58 } {44.194.68.221 } {34.194.241.136 } {2600:1f18:483:cf01:ebba:a861:1150:e245 } {2600:1f18:483:cf02:40f9:477f:ea6b:8a2b } {2600:1f18:483:cf02:8601:2257:9919:cd9e } {2600:1f18:483:cf01 :8212:fcdc:2a2a:50a7 } {2600:1f18:483:cf00:915d:9d2f:fc1f:40a7 } {2600:1f18:483:cf02:7a8b:1901:f1cf:3ab3 } {2600:1f18:483:cf00:27e2:dfeb:a6c7:c4db } {2600:1f18:483:cf01:ca3f:d96e:196c:7867 }] 2023-05-03T10:51:43.429298245+00:00 stderr F I0503 10:51:43.429151 184455 round_trippers.go:510] HTTP Trace: Dial to tcp:34.206.15.82:443 succeed
Expected results:
All images are pulled from the location as configured in the ICSP.
Additional info:
Description of problem:
When deploying a dual stack HostedCluster the user can define networks like this: networking: clusterNetwork: - cidr: fd01::/48 hostPrefix: 64 - cidr: 10.132.0.0/14 hostPrefix: 23 networkType: OVNKubernetes serviceNetwork: - cidr: fd02::/112 - cidr: 172.31.0.0/16 This will led to missconfiguration on the hosted cluster where services will have its ClusterIP set to IPv6 family (pod network will still default to IPv4 no matter what the order was). When deployin a dualstack cluster with the openshift-install binary there is a validation in place that will prevent users from configuring default IPv6 networks when deploying dual-stack clusters: ERROR failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: invalid "install-config.yaml" file: [networking.serviceNetwork: Invalid value: "fd02::/112, 172.30.0.0/16": IPv4 addresses must be listed before IPv6 addresses, networking.clusterNetwork: Invalid value: "fd01::/48, 10.132.0.0/14": IPv4 addresses must be listed before IPv6 addresses] ERROR failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: invalid "install-config.yaml" file: networking.clusterNetwork: Invalid value: "fd01::/48, 10.132.0.0/14": IPv4 addresses must be listed before IPv6 addresses HyperShift should detect this and either block the cluster creation or swap the order so the cluster gets created with default IPv4 networks.
Version-Release number of selected component (if applicable):
latest
How reproducible:
Always
Steps to Reproduce:
1. Deploy a HC with the networking settings specified and using the image with dual stack patches included quay.io/jparrill/hypershift:OCPBUGS-15331-mix-413v12
Actual results:
Cluster gets deployed with default IPv6 family for services network.
Expected results:
Cluster creation gets blocked OR cluster gets deployed with default IPv4 family for services network.
Additional info:
Description of the problem:
I am able to create a custom manifest with name .yaml
I blieve API should block this
How reproducible:
Using test infra i create a manifest with .yaml filename
Steps to reproduce:
1. using v2_create_cluster_manifest i am able to create manifest with ".yaml " filename
2.
3.
Actual results:
manifest created , no error thrown and i am able to list the manifest and see it is applied to cluster
Expected results:
should throw 422 exception
The cluster-version operator should not crash while trying to evaluate a bogus condition.
4.10 and later are exposed to the bug. It's possible that the OCPBUGS-19512 series increases exposure.
Unclear.
1. Create a cluster.
2. Point it at https://raw.githubusercontent.com/shellyyang1989/upgrade-cincy/master/cincy-conditional-edge.json (you may need to adjust version strings and digests for your test-cluster's release).
3. Wait around 30 minutes.
4. Point it at https://raw.githubusercontent.com/shellyyang1989/upgrade-cincy/master/cincy-conditional-edge-invalid-promql.json (again, may need some customization).
$ grep -B1 -A15 'too fresh' previous.log I0927 12:07:55.594222 1 cincinnati.go:114] Using a root CA pool with 0 root CA subjects to request updates from https://raw.githubusercontent.com/shellyyang1989/upgrade-cincy/master/cincy-conditional-edge-invalid-promql.json?arch=amd64&channel=stable-4.15&id=dc628f75-7778-457a-bb69-6a31a243c3a9&version=4.15.0-0.test-2023-09-27-091926-ci-ln-01zw7kk-latest I0927 12:07:55.726463 1 cache.go:118] {"type":"PromQL","promql":{"promql":"0 * group(cluster_version)"}} is the most stale cached cluster-condition match entry, but it is too fresh (last evaluated on 2023-09-27 11:37:25.876804482 +0000 UTC m=+175.082381015). However, we don't have a cached evaluation for {"type":"PromQL","promql":{"promql":"group(cluster_version_available_updates{channel=buggy})"}}, so attempt to evaluate that now. I0927 12:07:55.726602 1 cache.go:129] {"type":"PromQL","promql":{"promql":"0 * group(cluster_version)"}} is stealing this cluster-condition match call for {"type":"PromQL","promql":{"promql":"group(cluster_version_available_updates{channel=buggy})"}}, because its last evaluation completed 30m29.849594461s ago I0927 12:07:55.758573 1 cvo.go:703] Finished syncing available updates "openshift-cluster-version/version" (170.074319ms) E0927 12:07:55.758847 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) goroutine 194 [running]: k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1c4df00?, 0x32abc60}) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:75 +0x99 k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc001489d40?}) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:49 +0x75 panic({0x1c4df00, 0x32abc60}) /usr/lib/golang/src/runtime/panic.go:884 +0x213 github.com/openshift/cluster-version-operator/pkg/clusterconditions/promql.(*PromQL).Match(0xc0004860e0, {0x220ded8, 0xc00041e550}, 0x0) /go/src/github.com/openshift/cluster-version-operator/pkg/clusterconditions/promql/promql.go:134 +0x419 github.com/openshift/cluster-version-operator/pkg/clusterconditions/cache.(*Cache).Match(0xc0002d3ae0, {0x220ded8, 0xc00041e550}, 0xc0033948d0) /go/src/github.com/openshift/cluster-version-operator/pkg/clusterconditions/cache/cache.go:132 +0x982 github.com/openshift/cluster-version-operator/pkg/clusterconditions.(*conditionRegistry).Match(0xc000016760, {0x220ded8, 0xc00041e550}, {0xc0033948a0, 0x1, 0x0?})
No panics.
I'm still not entirely clear on how OCPBUGS-19512 would have increased exposure.
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/116
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-27264. The following is the description of the original issue:
—
Description of problem:
The e2e-aws-ovn-shared-to-local-gateway-mode-migration and e2e-aws-ovn-local-to-shared-gateway-mode-migration jobs fail about 50% of the time with + oc patch Network.operator.openshift.io cluster --type=merge --patch '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"gatewayConfig":{"routingViaHost":false}}}}}' network.operator.openshift.io/cluster patched + oc wait co network --for=condition=PROGRESSING=True --timeout=60s error: timed out waiting for the condition on clusteroperators/network
Description of problem:
When a workload includes a node selector term on the label kubernetes.io/arch and the allowed values do not include amd64, the auto scaler does not trigger the scale out of a valid, non-amd64, machine set if its current replicas are 0 and (for 4.14+) no architecture capacity annotation is set (ref MIXEDARCH-129).
The issue is due to https://github.com/openshift/kubernetes-autoscaler/blob/f0ceeacfca57014d07f53211a034641d52d85cfd/cluster-autoscaler/cloudprovider/utils.go#L33
This bug should be considered at first on clusters having the same architecture for the control plane and the data plane.
In the case of multi-arch compute clusters, there is probably no alternative than letting the capacity annotation to be properly set in the machine set either manually or by the cloud provider actuator, as already discussed in the MIXEDARCH-129 works, otherwise relying to the control plane architecture.
Version-Release number of selected component (if applicable):
- ARM64 IPI on GCP 4.14 - ARM64 IPI on Aws and Azure <=4.13 - In general, non-amd64 single-arch clusters supporting autoscale from 0
How reproducible:
Always
Steps to Reproduce:
1. Create an arm64 IPI cluster on GCP 2. Set one of the machinesets to have 0 replicas: oc scale -n openshift-machine-api machineset/adistefa-a1-zn8pg-worker-f 3. Deploy the default autoscaler 4. Deploy the machine autoscaler for the given machineset 5. Deploy a workload with node affinity to arm64 only nodes, large resource requests and enough number of replicas.
Actual results:
From the pod events: pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity/selector
Expected results:
The cluster autoscaler scales the machineset with 0 replicas in order to provide resources for the pending pods.
Additional info:
--- apiVersion: autoscaling.openshift.io/v1 kind: ClusterAutoscaler metadata: name: default spec: {} --- apiVersion: autoscaling.openshift.io/v1beta1 kind: MachineAutoscaler metadata: name: worker-us-east-1a namespace: openshift-machine-api spec: minReplicas: 0 maxReplicas: 12 scaleTargetRef: apiVersion: machine.openshift.io/v1beta1 kind: MachineSet name: adistefa-a1-zn8pg-worker-f --- apiVersion: apps/v1 kind: Deployment metadata: namespace: openshift-machine-api name: 'my-deployment' annotations: {} spec: selector: matchLabels: app: name replicas: 3 template: metadata: labels: app: name spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/arch operator: In values: - "arm64" containers: - name: container image: >- image-registry.openshift-image-registry.svc:5000/openshift/httpd:latest ports: - containerPort: 8080 protocol: TCP env: [] resources: requests: cpu: "2" imagePullSecrets: [] strategy: type: RollingUpdate rollingUpdate: maxSurge: 25% maxUnavailable: 25% paused: false
Description of problem:
The GCP Mint mode sync is failing when attempting to add permissions to a previously deleted custom role.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. Create a gcp cluster in mint mode (with a CCO credentialRequests that has permissions defined) 2. Delete the openshift-hive-dev-cloud-credential-operator-gcp-ro-creds custom role from GCP 3. oc -n openshift-cloud-credential-operator delete secret cloud-credential-operator-gcp-ro-creds
Actual results:
Receive the following error when attempting to add permissions to the deleted custom role: "cloud-credential-operator cannot add new grants to deleted gcp role"
Expected results:
The new permissions should be added to the role without issue.
Additional info:
Description of the problem:
MCE operator installation version is 2.3 only , It should be dynamic and consider OCP version
ocp_mce_version_matrix:
'4.14': '2.4'
'4.13': '2.3'
'4.12': '2.2'
'4.11': '2.1'
'4.10': '2.0'
How reproducible:
100%{}
Steps to reproduce:
1. Create a 4.12 cluster
2. Select MCE operator to be installed on cluster
3. Install cluster
4. Verify OCP and MCE versions
Actual results:
OCP 4.12.26, MCE 2.3.0
Looks like service install 2.3 only and not consider OCP version
https://github.com/openshift/assisted-service/blob/master/internal/operators/mce/config.go
const ( MceMinOpenshiftVersion string = "4.10.0" MceChannel string = "stable-2.3"
Expected results:
MCE 2.2
MCE installation version should be dynamic and depends on OCP version
ocp_mce_version_matrix:
'4.14': '2.4'
'4.13': '2.3'
'4.12': '2.2'
'4.11': '2.1'
'4.10': '2.0'
Please review the following PR: https://github.com/openshift/cluster-api-provider-ibmcloud/pull/58
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/containernetworking-plugins/pull/122
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
While upgrading a loaded 250 node ROSA cluster from 4.13.13 to 4.14.rc2 the cluster failed to upgrade and was stuck at when network operator was trying
to upgrade.
Around 20 multus pods were in CrashLookpack state with the log
oc logs multus-4px8t 2023-10-10T00:54:34+00:00 [cnibincopy] Successfully copied files in /usr/src/multus-cni/rhel9/bin/ to /host/opt/cni/bin/upgrade_6dcb644a-4164-42a5-8f1e-4ae2c04dc315 2023-10-10T00:54:34+00:00 [cnibincopy] Successfully moved files in /host/opt/cni/bin/upgrade_6dcb644a-4164-42a5-8f1e-4ae2c04dc315 to /host/opt/cni/bin/ 2023-10-10T00:54:34Z [verbose] multus-daemon started 2023-10-10T00:54:34Z [verbose] Readiness Indicator file check 2023-10-10T00:55:19Z [error] have you checked that your default network is ready? still waiting for readinessindicatorfile @ /host/run/multus/cni/net.d/10-ovn-kubernetes.conf. pollimmediate error: timed out waiting for the condition
Tracker issue for bootimage bump in 4.15. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-12868.
Please review the following PR: https://github.com/openshift/cloud-credential-operator/pull/600
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-26197. The following is the description of the original issue:
—
Description of problem:
pki operator runs even when annotation to turn off PKI is on the hosted control plane
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of the problem:
Agents don't run the StepTypeTangConnectivityCheck step on day-2 hosts in imported clusters
How reproducible:
Unknown
Steps to reproduce:
1. Install day-1 cluster with Tang
2. Attempt to add day-2 host
Actual results:
disk-encryption-requirements-satisfied stuck pending
Expected results:
disk-encryption-requirements-satisfied should be eventually either failed or success
Platform:
IPI on Baremetal
What happened?
In cases where no hostname is provided, host are automatically assigned the name "localhost" or "localhost.localdomain".
[kni@provisionhost-0-0 ~]$ oc get nodes
NAME STATUS ROLES AGE VERSION
localhost.localdomain Ready master 31m v1.22.1+6859754
master-0-1 Ready master 39m v1.22.1+6859754
master-0-2 Ready master 39m v1.22.1+6859754
worker-0-0 Ready worker 12m v1.22.1+6859754
worker-0-1 Ready worker 12m v1.22.1+6859754
What did you expect to happen?
Having all hosts come up as localhost is the worst possible user experience, because they'll fail to form a cluster but you won't know why.
However, we know the BMH name in the image-customization-controller, it would be possible to configure the ignition to set a default hostname if we don't have one from DHCP/DNS.
If not, we should at least fail the installation with a specific error message to this situation.
----------
30/01/22 - adding how to reproduce
----------
How to Reproduce:
1)prepare and installation with day-1 static ip.
add to install-config uner one of the nodes:
networkConfig:
routes:
config:
2)Ensure a DNS PTR for the address IS NOT configured.
3)create manifests and cluster from install-config.yaml
installation should either:
1)fail as early as possible, and provide some sort of feed back as to the fact that no hostname was provided.
2)derive the Hostname from the bmh or the ignition files
Please review the following PR: https://github.com/openshift/oc/pull/1544
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
prepare Hypershift for the CAPI bump to v1.5.2 https://github.com/openshift/cluster-api/pull/181 so that hypershift-e2e can pass.
This is a clone of issue OCPBUGS-19830. The following is the description of the original issue:
—
Description of problem:
There are several testcases in conformance testsuite that are failing due to openshift-multus configuration.
We are running conformance testsuite as part of our Openshift on Openstack CI. We use that just to confirm correct functionality of the cluster. The command we are using to run the test suite is:
openshift-tests run --provider '{\"type\":\"openstack\"}' openshift/conformance/parallel
The name of the tests that failed are:
1. sig-arch] Managed cluster should ensure platform components have system-* priority class associated [Suite:openshift/conformance/parallel]
Reason is:
6 pods found with invalid priority class (should be openshift-user-critical or begin with system-): openshift-multus/whereabouts-reconciler-6q6h7 (currently "") openshift-multus/whereabouts-reconciler-87dwn (currently "") openshift-multus/whereabouts-reconciler-fvhwv (currently "") openshift-multus/whereabouts-reconciler-h68h5 (currently "") openshift-multus/whereabouts-reconciler-nlz59 (currently "") openshift-multus/whereabouts-reconciler-xsch6 (currently "")
2. [sig-arch] Managed cluster should only include cluster daemonsets that have maxUnavailable or maxSurge update of 10 percent or maxUnavailable of 33 percent [Suite:openshift/conformance/parallel]
Reason is:
fail [github.com/openshift/origin/test/extended/operators/daemon_set.go:105]: Sep 23 16:12:15.283: Daemonsets found that do not meet platform requirements for update strategy: expected daemonset openshift-multus/whereabouts-reconciler to have maxUnavailable 10% or 33% (see comment) instead of 1, or maxSurge 10% instead of 0 Ginkgo exit error 1: exit with code 1
3.[sig-arch] Managed cluster should set requests but not limits [Suite:openshift/conformance/parallel]
Reason is:
fail [github.com/openshift/origin/test/extended/operators/resources.go:196]: Sep 23 16:12:17.489: Pods in platform namespaces are not following resource request/limit rules or do not have an exception granted: apps/v1/DaemonSet/openshift-multus/whereabouts-reconciler/container/whereabouts defines a limit on cpu of 50m which is not allowed (rule: "apps/v1/DaemonSet/openshift-multus/whereabouts-reconciler/container/whereabouts/limit[cpu]") apps/v1/DaemonSet/openshift-multus/whereabouts-reconciler/container/whereabouts defines a limit on memory of 100Mi which is not allowed (rule: "apps/v1/DaemonSet/openshift-multus/whereabouts-reconciler/container/whereabouts/limit[memory]") Ginkgo exit error 1: exit with code 1
4. [sig-node][apigroup:config.openshift.io] CPU Partitioning cluster platform workloads should be annotated correctly for DaemonSets [Suite:openshift/conformance/parallel]
Reason is:
fail [github.com/openshift/origin/test/extended/cpu_partitioning/pods.go:159]: Expected <[]error | len:1, cap:1>: [ <*errors.errorString | 0xc0010fa380>{ s: "daemonset (whereabouts-reconciler) in openshift namespace (openshift-multus) must have pod templates annotated with map[target.workload.openshift.io/management:{\"effect\": \"PreferredDuringScheduling\"}]", }, ] to be empty
How reproducible: Always
Steps to Reproduce: Run conformance testsuite:
https://github.com/openshift/origin/blob/master/test/extended/README.md
Actual results: Testcases failing
Expected results: Testcases passing
Please review the following PR: https://github.com/openshift/coredns/pull/95
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
While trying to develop a demo for a Java application, that first builds using the source-to-image strategy and then uses the resulting image to copy artefacts from the s2i-builder+compiled sources-image to a slimmer runtime image using an inline Dockerfile build strategy on OpenShift, the deployment then fails since the inline Dockerfile hooks doesn't preserve the modification time of the file that gets copied. This is different to how 'docker' itself does it with a multi-stage build.
Version-Release number of selected component (if applicable):
4.12.14
How reproducible:
Always
Steps to Reproduce:
1. git clone https://github.com/jerboaa/quarkus-quickstarts 2. cd quarkus-quickstarts && git checkout ocp-bug-inline-docker 3. oc new-project quarkus-appcds-nok 4. oc process -f rest-json-quickstart/openshift/quarkus_runtime_appcds_template.yaml | oc create -f -
Actual results:
$ oc logs quarkus-rest-json-appcds-4-xc47z INFO exec -a "java" java -XX:MaxRAMPercentage=80.0 -XX:+UseParallelGC -XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=20 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -XX:+ExitOnOutOfMemoryError -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -Xshare:on -XX:SharedArchiveFile=/deployments/app-cds.jsa -Dquarkus.http.host=0.0.0.0 -cp "." -jar /deployments/rest-json-quickstart-1.0.0-SNAPSHOT-runner.jar INFO running in /deployments Error occurred during initialization of VM Unable to use shared archive. An error has occurred while processing the shared archive file. A jar file is not the one used while building the shared archive file: rest-json-quickstart-1.0.0-SNAPSHOT-runner.jar
Expected results:
Starting the Java application using /opt/jboss/container/java/run/run-java.sh ... INFO exec -a "java" java -XX:MaxRAMPercentage=80.0 -XX:+UseParallelGC -XX:MinHeapFreeRatio=10 -XX:MaxHeapFreeRatio=20 -XX:GCTimeRatio=4 -XX:AdaptiveSizePolicyWeight=90 -XX:+ExitOnOutOfMemoryError -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -Xshare:on -XX:SharedArchiveFile=/deployments/app-cds.jsa -Dquarkus.http.host=0.0.0.0 -cp "." -jar /deployments/rest-json-quickstart-1.0.0-SNAPSHOT-runner.jar INFO running in /deployments __ ____ __ _____ ___ __ ____ ______ --/ __ \/ / / / _ | / _ \/ //_/ / / / __/ -/ /_/ / /_/ / __ |/ , _/ ,< / /_/ /\ \ --\___\_\____/_/ |_/_/|_/_/|_|\____/___/ 2023-10-27 18:13:01,866 INFO [io.quarkus] (main) rest-json-quickstart 1.0.0-SNAPSHOT on JVM (powered by Quarkus 3.4.3) started in 0.966s. Listening on: http://0.0.0.0:8080 2023-10-27 18:13:01,867 INFO [io.quarkus] (main) Profile prod activated. 2023-10-27 18:13:01,867 INFO [io.quarkus] (main) Installed features: [cdi, resteasy-reactive, resteasy-reactive-jackson, smallrye-context-propagation, vertx]
Additional info:
When deploying with AppCDS turned on, then we can get the pods to start and when we then look at the modified file time of the offending file we notice that these differ from the original s2i-merge-image (A) and the runtime image (B): (A) $ oc rsh quarkus-rest-json-appcds-s2i-1-x5hct stat /deployments/rest-json-quickstart-1.0.0-SNAPSHOT-runner.jar File: /deployments/rest-json-quickstart-1.0.0-SNAPSHOT-runner.jar Size: 16057039 Blocks: 31368 IO Block: 4096 regular file Device: 200001h/2097153d Inode: 60146490 Links: 1 Access: (0664/-rw-rw-r--) Uid: ( 185/ default) Gid: ( 0/ root) Access: 2023-10-27 18:11:22.000000000 +0000 Modify: 2023-10-27 18:11:22.000000000 +0000 Change: 2023-10-27 18:11:41.555586774 +0000 Birth: 2023-10-27 18:11:41.491586774 +0000 (B) $ oc rsh quarkus-rest-json-appcds-1-l7xw2 stat /deployments/rest-json-quickstart-1.0.0-SNAPSHOT-runner.jar File: /deployments/rest-json-quickstart-1.0.0-SNAPSHOT-runner.jar Size: 16057039 Blocks: 31368 IO Block: 4096 regular file Device: 2000a3h/2097315d Inode: 71601163 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2023-10-27 18:11:44.000000000 +0000 Modify: 2023-10-27 18:11:44.000000000 +0000 Change: 2023-10-27 18:12:12.169087346 +0000 Birth: 2023-10-27 18:12:12.114087346 +0000 Both should have 'Modify: 2023-10-27 18:11:22.000000000 +0000'.
When I perform a local s2i build of the same application sources and then use this multi-stage Dockerfile, the modify time of the files remain the same.
FROM quarkus-app-uberjar:ubi9 as s2iimg FROM registry.access.redhat.com/ubi9/openjdk-17-runtime as final COPY --from=s2iimg /deployments/* /deployments/ ENV JAVA_OPTS_APPEND="-XX:+UseCompressedClassPointers -XX:+UseCompressedOops -Xshare:on -XX:SharedArchiveFile=app-cds.jsa"
as shown here:
$ sudo docker run --rm -ti --entrypoint /bin/bash quarkus-app-uberjar:ubi9 -c 'stat /deployments/rest-json-quickstart-1.0.0-SNAPSHOT-runner.jar' File: /deployments/rest-json-quickstart-1.0.0-SNAPSHOT-runner.jar Size: 16057020 Blocks: 31368 IO Block: 4096 regular file Device: 6fh/111d Inode: 276781319 Links: 1 Access: (0664/-rw-rw-r--) Uid: ( 185/ default) Gid: ( 0/ root) Access: 2023-10-27 15:52:28.000000000 +0000 Modify: 2023-10-27 15:52:28.000000000 +0000 Change: 2023-10-27 15:52:37.352926632 +0000 Birth: 2023-10-27 15:52:37.288926109 +0000 $ sudo docker run --rm -ti --entrypoint /bin/bash quarkus-cds-app -c 'stat /deployments/rest-json-quickstart-1.0.0-SNAPSHOT-runner.jar' File: /deployments/rest-json-quickstart-1.0.0-SNAPSHOT-runner.jar Size: 16057020 Blocks: 31368 IO Block: 4096 regular file Device: 6fh/111d Inode: 14916403 Links: 1 Access: (0664/-rw-rw-r--) Uid: ( 185/ default) Gid: ( 0/ root) Access: 2023-10-27 15:52:28.000000000 +0000 Modify: 2023-10-27 15:52:28.000000000 +0000 Change: 2023-10-27 15:53:04.408147760 +0000 Birth: 2023-10-27 15:53:04.346147253 +0000
Both have a modified file time of 2023-10-27 15:52:28.000000000 +0000
Description of problem:
When any object is created from YAML with empty editor window, the application crashes.
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Navigate to Virtualization -> VirtualMachines 2. Open "Create VirtualMachine" menu 3. Select "With YAML" 4. Clear the editor content 5. Click "Create" button
Actual results:
The application crashes
Expected results:
User is notified about invalid/empty editor content.
Additional info:
The same happens in 4.13
Please review the following PR: https://github.com/openshift/k8s-prometheus-adapter/pull/95
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-30132. The following is the description of the original issue:
—
Description of problem:
In OCP 4.14 the catalog pods in openshift-marketplace where defined as: $ oc get pods -n openshift-marketplace redhat-operators-4bnz4 -o yaml apiVersion: v1 kind: Pod metadata: ... labels: olm.catalogSource: redhat-operators olm.pod-spec-hash: 658b699dc name: redhat-operators-4bnz4 namespace: openshift-marketplace ... spec: containers: - image: registry.redhat.io/redhat/redhat-operator-index:v4.14 imagePullPolicy: Always Now on OCP 4.15 they are defined as: apiVersion: v1 kind: Pod metadata: ... name: redhat-operators-44wxs namespace: openshift-marketplace ownerReferences: - apiVersion: operators.coreos.com/v1alpha1 blockOwnerDeletion: false controller: true kind: CatalogSource name: redhat-operators uid: 3b41ac7b-7ad1-4d58-a62f-4a9e667ae356 resourceVersion: "877589" uid: 65ad927c-3764-4412-8d34-82fd856a4cbc spec: containers: - args: - serve - /extracted-catalog/catalog - --cache-dir=/extracted-catalog/cache command: - /bin/opm ... image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:7259b65d8ae04c89cf8c4211e4d9ddc054bb8aebc7f26fac6699b314dc40dbe3 imagePullPolicy: Always ... initContainers: ... - args: - --catalog.from=/configs - --catalog.to=/extracted-catalog/catalog - --cache.from=/tmp/cache - --cache.to=/extracted-catalog/cache command: - /utilities/copy-content image: registry.redhat.io/redhat/redhat-operator-index:v4.15 imagePullPolicy: IfNotPresent ... And due to `imagePullPolicy: IfNotPresent` on the initContainer used to extract the index image (referenced by tag) content, they are never really updated.
Version-Release number of selected component (if applicable):
OCP 4.15.0
How reproducible:
100%
Steps to Reproduce:
1. wait for the next version of a released operator on OCP 4.15 2. 3.
Actual results:
Operator catalogs are never really refreshed due to imagePullPolicy: IfNotPresent for the index image
Expected results:
Operator catalogs are periodically (every 10 minutes by default) refreshed
Additional info:
OCPBUGS-5469 and backports began prioritizing later target releases, but we still wait 10m between different PromQL evaluations while evaluating conditional update risks. This ticket is tracking work to speed up cache warming, and allows changes that are too invasive to be worth backporting.
Definition of done:
Acceptance Criteria:
Description of problem:
Extra space is in the translation text(Chinese) of Duplicate RoleBinding' in kebab list The change of PR https://github.com/openshift/console/pull/12099 for some reason are not included into the master/release4.12-4.14 branch
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-10-08-220853
How reproducible:
Always
Steps to Reproduce:
1. Login OCP, update language to Chinese 2. Navigate to RoleBindings page, choose one rolebinding, click the kebab icon on the end, check the translation text of 'Duplicate RoleBinding' 3.
Actual results:
2. It's shown '重复 角色绑定' and "重复 集群角色绑定"
Expected results:
Remove extra space It's shown '重复角色绑定' and "重复集群角色绑定"
Additional info:
Description of problem:
According to https://docs.openshift.com/container-platform/4.11/release_notes/ocp-4-11-release-notes.html#ocp-4-11-deprecated-features-crio-parameters and Red Hat Insights, logSizeMax is deprecated in ContainerRuntimeConfig and shall instead be created via containerLogMaxSize in KubeletConfig. When starting that transition though, it was noticed that a ContainerRuntimeConfig as shown below, would still add logSizeMax and even overlaySize to the ContainerRuntimeConfig spec. $ bat /tmp/crio.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: ContainerRuntimeConfig metadata: name: pidlimit spec: machineConfigPoolSelector: matchLabels: pools.operator.machineconfiguration.openshift.io/worker: '' containerRuntimeConfig: pidsLimit: 4096 logLevel: debug $ oc get containerruntimeconfig pidlimit -o json | jq '.spec.containerRuntimeConfig' { "logLevel": "debug", "logSizeMax": "0", "overlaySize": "0", "pidsLimit": 4096 } When checking on the OpenShift Container Platform 4 - Node, using crio coonfig, we can see that the values are not applied. Yet it's disturbing to see those options added in the specification when in fact Red Hat is recommending to move them into KubeletConfig and remove them from ContainerRuntimeConfig. Further, having them still set in ContainerRuntimeConfig will trigger a false/positive alert in Red Hat Insights as generally the customer may have followed the recommendation but the system does not comply with the changes made :-) Also interesting , similar problem was reported a while ago in https://bugzilla.redhat.com/show_bug.cgi?id=1941936 and fixed. Hence it's interesting that this is coming back again.
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.13.4
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift Container Platform 4.13.4 2. Create ContainerRuntimeConfig as shown above and validate the actual object created 3. Run oc get containerruntimeconfig pidlimit -o json | jq '.spec.containerRuntimeConfig' to validate the object created and inspect the spec.
Actual results:
$ oc get containerruntimeconfig pidlimit -o json | jq '.spec.containerRuntimeConfig' { "logLevel": "debug", "logSizeMax": "0", "overlaySize": "0", "pidsLimit": 4096 }
Expected results:
$ oc get containerruntimeconfig pidlimit -o json | jq '.spec.containerRuntimeConfig' { "logLevel": "debug", "pidsLimit": 4096 }
Additional info:
Please review the following PR: https://github.com/openshift/machine-api-provider-powervs/pull/51
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/console-operator/pull/794
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/196
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
ipsec container kills pluto even if that was started by systemd
Version-Release number of selected component (if applicable):
on any 4.14 nightly
How reproducible:
every time
Steps to Reproduce:
1. enable N-S ipsec 2. enable E-W IPsec 3. kill/stop/delete one of the ipsec-host pods
Actual results:
pluto is killed on that host
Expected results:
pluto keeps running
Additional info:
https://github.com/yuvalk/cluster-network-operator/blob/37d1cc72f4f6cd999046bd487a705e6da31301a5/bindata/network/ovn-kubernetes/common/ipsec-host.yaml#L235 this should be removed
Please review the following PR: https://github.com/openshift/cloud-provider-kubevirt/pull/28
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Assisted environment: SaaS (console.redhat.com)
Interface: REST API **
OCP version:
Configuration:
3 masters, 3 workers
3 masters having a small extra disk (2GB) for etcd
3 workers having an extra disk 100GB
Validations failing checking the small disk of the masters for ODF, increasing the disk for etcd, solves the issue.The validation code: https://github.com/openshift/assisted-service/blob/7e715004c9a4c77e056bd91fe698f7f68232418f/internal/operators/odf/validations.go#L162The code should check only the workers when is not a compact clusters
Description of problem:
Currently CAPI Cluster object always stays in `Provisioning` state. This is because there is nothing that sets the ControlPlaneEndpoint field on the object.
Version-Release number of selected component (if applicable):
all
How reproducible:
Always
Steps to Reproduce:
1. Run E2Es 2. See that Cluster always stays in Provisioning state 3.
Actual results:
Cluster always stays in Provisioning state
Expected results:
Cluster should go into Provisioned state
Additional info:
As such we need to update the E2E tests and the objects creation scripts so that they set the ControlPlaneEndpoint before Cluster object creation, to make the Cluster go into Provisioned state. This is a temporary workaround, as we expect the Cluster & InfrastructureCluster objects creation and the population of the ControlPlaneEndpoint is going to happen in a dedicated controller within the operator.
Please review the following PR: https://github.com/openshift/whereabouts-cni/pull/192
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-29773. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When running must-gather against an SNO with Telco DU profile the perf-node-gather-daemonset seems to not be able to start with the error below: Warning FailedCreate 2m37s (x16 over 5m21s) daemonset-controller Error creating: pods "perf-node-gather-daemonset-" is forbidden: autoscaling.openshift.io/ManagementCPUsOverride the pod namespace "openshift-must-gather-sbhml" does not allow the workload type management must-gather shows it's retrying for 300s and reports that performance data collection was complete even though the daemonset pod didn't come up. [must-gather-nhbgr] POD 2023-09-26T10:15:39.591582116Z Waiting for performance profile collector pods to become ready: 1 [..] [must-gather-nhbgr] POD 2023-09-26T10:21:07.108893075Z Waiting for performance profile collector pods to become ready: 300 [must-gather-nhbgr] POD 2023-09-26T10:21:08.473217146Z daemonset.apps "perf-node-gather-daemonset" deleted [must-gather-nhbgr] POD 2023-09-26T10:21:08.480906220Z INFO: Node performance data collection complete.
Version-Release number of selected component (if applicable):
4.14.0-rc.2
How reproducible:
100%
Steps to Reproduce:
1. Deploy SNO with Telco DU profile 2. Run oc adm must-gather
Actual results:
performance data collection doesn't run because daemonset cannot be scheduled.
Expected results:
performance data collection runs.
Additional info:
DaemonSet describe: oc -n openshift-must-gather-sbhml describe ds Name: perf-node-gather-daemonset Selector: name=perf-node-gather-daemonset Node-Selector: <none> Labels: <none> Annotations: deprecated.daemonset.template.generation: 1 Desired Number of Nodes Scheduled: 1 Current Number of Nodes Scheduled: 0 Number of Nodes Scheduled with Up-to-date Pods: 0 Number of Nodes Scheduled with Available Pods: 0 Number of Nodes Misscheduled: 0 Pods Status: 0 Running / 0 Waiting / 0 Succeeded / 0 Failed Pod Template: Labels: name=perf-node-gather-daemonset Annotations: target.workload.openshift.io/management: {"effect": "PreferredDuringScheduling"} Containers: node-probe: Image: registry.kni-qe-0.lab.eng.rdu2.redhat.com:5000/openshift-release-dev@sha256:2af2c135f69f162ed8e0cede609ddbd207d71a3c7bd49e9af3fcbb16737aa25a Port: <none> Host Port: <none> Command: /bin/bash -c echo ok > /tmp/healthy && sleep INF Limits: cpu: 100m memory: 256Mi Requests: cpu: 100m memory: 256Mi Readiness: exec [cat /tmp/healthy] delay=5s timeout=1s period=5s #success=1 #failure=3 Environment: <none> Mounts: /host/podresources from podres (rw) /host/proc from proc (ro) /host/sys from sys (ro) /lib/modules from lib-modules (ro) Volumes: sys: Type: HostPath (bare host directory volume) Path: /sys HostPathType: Directory proc: Type: HostPath (bare host directory volume) Path: /proc HostPathType: Directory lib-modules: Type: HostPath (bare host directory volume) Path: /lib/modules HostPathType: Directory podres: Type: HostPath (bare host directory volume) Path: /var/lib/kubelet/pod-resources HostPathType: Directory Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedCreate 2m37s (x16 over 5m21s) daemonset-controller Error creating: pods "perf-node-gather-daemonset-" is forbidden: autoscaling.openshift.io/ManagementCPUsOverride the pod namespace "openshift-must-gather-sbhml" does not allow the workload type management
Description of problem:
Install a private cluster, the base domain set in install-config.yaml is same as another existed cis domain name. After destroy the private cluster, the dns resource-records remains.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1.create a DNS service instance, setting its domain to "ibmcloud.qe.devcluster.openshift.com", Note, this domain name is also being used in another existing CIS domain. 2.Install a private ibmcloud cluster, the base domain set in install-config is "ibmcloud.qe.devcluster.openshift.com" 3.Destroy the cluster 4.Check the remains dns records
Actual results:
$ ibmcloud dns resource-records 5f8a0c4d-46c2-4daa-9157-97cb9ad9033a -i preserved-openshift-qe-private | grep ci-op-17qygd06-23ac4 api-int.ci-op-17qygd06-23ac4.ibmcloud.qe.devcluster.openshift.com *.apps.ci-op-17qygd06-23ac4.ibmcloud.qe.devcluster.openshift.com api.ci-op-17qygd06-23ac4.ibmcloud.qe.devcluster.openshift.com
Expected results:
No more dns records about the cluster
Additional info:
$ ibmcloud dns zones -i preserved-openshift-qe-private | awk '{print $2}' Name private-ibmcloud.qe.devcluster.openshift.com private-ibmcloud-1.qe.devcluster.openshift.com ibmcloud.qe.devcluster.openshift.com $ ibmcloud cis domains Name ibmcloud.qe.devcluster.openshift.com When use private-ibmcloud.qe.devcluster.openshift.com and private-ibmcloud-1.qe.devcluster.openshift.com as domain, no such issue, when use ibmcloud.qe.devcluster.openshift.com as domain the dns records remains.
Description of problem:
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-upgrade/job/upgrade-pipeline/46064/consoleFull
https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-upgrade/job/upgrade-pipeline/46126/console
Version-Release number of selected component (if applicable):
How reproducible:
two upgrades, two failed.
Steps to Reproduce:
Triggered 2 upgrade for template 11_UPI on vSphere 8.0& FIPS ON & OVN IPSEC & Static Network & Bonding & HW19 & Secureboot (IPSEC E-W only) 1. From 4.13.26-x86_64 - > 4.14.0-0.nightly-2023-12-08-072853->4.15.0-0.nightly-2023-12-09-012410 12-11 16:28:56.968 oc get clusteroperators: 12-11 16:28:56.968 NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE 12-11 16:28:56.968 authentication 4.15.0-0.nightly-2023-12-09-012410 False False True 104m APIServicesAvailable: "oauth.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request... 12-11 16:28:56.968 baremetal 4.15.0-0.nightly-2023-12-09-012410 True False False 5h39m 12-11 16:28:56.968 cloud-controller-manager 4.15.0-0.nightly-2023-12-09-012410 True False False 5h43m 12-11 16:28:56.968 cloud-credential 4.15.0-0.nightly-2023-12-09-012410 True False False 5h45m 12-11 16:28:56.968 cluster-autoscaler 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m 12-11 16:28:56.968 config-operator 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m 12-11 16:28:56.968 console 4.15.0-0.nightly-2023-12-09-012410 False False False 107m RouteHealthAvailable: console route is not admitted 12-11 16:28:56.968 control-plane-machine-set 4.15.0-0.nightly-2023-12-09-012410 True False False 5h39m 12-11 16:28:56.968 csi-snapshot-controller 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m 12-11 16:28:56.968 dns 4.15.0-0.nightly-2023-12-09-012410 True False False 5h39m 12-11 16:28:56.968 etcd 4.15.0-0.nightly-2023-12-09-012410 True False False 5h38m 12-11 16:28:56.968 image-registry 4.15.0-0.nightly-2023-12-09-012410 True False False 109m 12-11 16:28:56.968 ingress 4.15.0-0.nightly-2023-12-09-012410 True False False 108m 12-11 16:28:56.968 insights 4.15.0-0.nightly-2023-12-09-012410 True False False 5h33m 12-11 16:28:56.968 kube-apiserver 4.15.0-0.nightly-2023-12-09-012410 True False False 5h35m 12-11 16:28:56.968 kube-controller-manager 4.15.0-0.nightly-2023-12-09-012410 True False True 5h37m GarbageCollectorDegraded: error querying alerts: Post "[https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query]": dial tcp 172.30.77.136:9091: i/o timeout 12-11 16:28:56.968 kube-scheduler 4.15.0-0.nightly-2023-12-09-012410 True False False 5h37m 12-11 16:28:56.968 kube-storage-version-migrator 4.15.0-0.nightly-2023-12-09-012410 True False False 109m 12-11 16:28:56.968 machine-api 4.15.0-0.nightly-2023-12-09-012410 True False False 5h36m 12-11 16:28:56.968 machine-approver 4.15.0-0.nightly-2023-12-09-012410 True False False 5h39m 12-11 16:28:56.968 machine-config 4.14.0-0.nightly-2023-12-08-072853 True False False 5h39m 12-11 16:28:56.968 marketplace 4.15.0-0.nightly-2023-12-09-012410 True False False 5h39m 12-11 16:28:56.968 monitoring 4.15.0-0.nightly-2023-12-09-012410 False True True 63s UpdatingThanosQuerier: reconciling Thanos Querier Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io thanos-querier), UpdatingAlertmanager: reconciling Alertmanager Route failed: updating Route object failed: the server is currently unable to handle the request (put routes.route.openshift.io alertmanager-main), UpdatingUserWorkloadThanosRuler: reconciling Thanos Ruler Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io thanos-ruler), UpdatingPrometheus: reconciling Prometheus API Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io prometheus-k8s), UpdatingUserWorkloadPrometheus: reconciling UserWorkload federate Route failed: updating Route object failed: the server is currently unable to handle the request (put routes.route.openshift.io federate) 12-11 16:28:56.968 network 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m 12-11 16:28:56.968 node-tuning 4.15.0-0.nightly-2023-12-09-012410 True False False 124m 12-11 16:28:56.968 openshift-apiserver 4.15.0-0.nightly-2023-12-09-012410 False False False 97m APIServicesAvailable: "image.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request 12-11 16:28:56.968 openshift-controller-manager 4.15.0-0.nightly-2023-12-09-012410 True False False 5h39m 12-11 16:28:56.968 openshift-samples 4.15.0-0.nightly-2023-12-09-012410 True False False 124m 12-11 16:28:56.968 operator-lifecycle-manager 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m 12-11 16:28:56.968 operator-lifecycle-manager-catalog 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m 12-11 16:28:56.968 operator-lifecycle-manager-packageserver 4.15.0-0.nightly-2023-12-09-012410 True False False 100m 12-11 16:28:56.968 service-ca 4.15.0-0.nightly-2023-12-09-012410 True False False 5h40m 12-11 16:28:56.968 storage 4.15.0-0.nightly-2023-12-09-012410 True False False 104m 2. From 4.14.5-x86_64 - > 4.15.0-0.nightly-2023-12-11-033133 % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.15.0-0.nightly-2023-12-11-033133 False False True 3h32m APIServicesAvailable: "user.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request... baremetal 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m cloud-controller-manager 4.15.0-0.nightly-2023-12-11-033133 True False False 5h47m cloud-credential 4.15.0-0.nightly-2023-12-11-033133 True False False 5h50m cluster-autoscaler 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m config-operator 4.15.0-0.nightly-2023-12-11-033133 True False False 5h46m console 4.15.0-0.nightly-2023-12-11-033133 False False False 3h30m RouteHealthAvailable: console route is not admitted control-plane-machine-set 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m csi-snapshot-controller 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m dns 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m etcd 4.15.0-0.nightly-2023-12-11-033133 True False False 5h43m image-registry 4.15.0-0.nightly-2023-12-11-033133 True False False 3h34m ingress 4.15.0-0.nightly-2023-12-11-033133 True False False 4h22m insights 4.15.0-0.nightly-2023-12-11-033133 True False False 5h39m kube-apiserver 4.15.0-0.nightly-2023-12-11-033133 True False False 5h42m kube-controller-manager 4.15.0-0.nightly-2023-12-11-033133 True False True 5h42m GarbageCollectorDegraded: error fetching rules: Get "[https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules]": dial tcp 172.30.237.96:9091: i/o timeout kube-scheduler 4.15.0-0.nightly-2023-12-11-033133 True False False 5h42m kube-storage-version-migrator 4.15.0-0.nightly-2023-12-11-033133 True False False 3h34m machine-api 4.15.0-0.nightly-2023-12-11-033133 True False False 5h41m machine-approver 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m machine-config 4.14.5 True False False 5h44m marketplace 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m monitoring 4.15.0-0.nightly-2023-12-11-033133 False True True 4m32s UpdatingAlertmanager: reconciling Alertmanager Route failed: updating Route object failed: the server is currently unable to handle the request (put routes.route.openshift.io alertmanager-main), UpdatingUserWorkloadThanosRuler: reconciling Thanos Ruler Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io thanos-ruler), UpdatingThanosQuerier: reconciling Thanos Querier Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io thanos-querier), UpdatingPrometheus: reconciling Prometheus API Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io prometheus-k8s), UpdatingUserWorkloadPrometheus: reconciling UserWorkload federate Route failed: retrieving Route object failed: the server is currently unable to handle the request (get routes.route.openshift.io federate) network 4.15.0-0.nightly-2023-12-11-033133 True False False 5h44m node-tuning 4.15.0-0.nightly-2023-12-11-033133 True False False 3h48m openshift-apiserver 4.15.0-0.nightly-2023-12-11-033133 False False False 11m APIServicesAvailable: "apps.openshift.io.v1" is not ready: an attempt failed with statusCode = 503, err = the server is currently unable to handle the request... openshift-controller-manager 4.15.0-0.nightly-2023-12-11-033133 True False False 5h41m openshift-samples 4.15.0-0.nightly-2023-12-11-033133 True False False 3h49m operator-lifecycle-manager 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m operator-lifecycle-manager-catalog 4.15.0-0.nightly-2023-12-11-033133 True False False 5h45m operator-lifecycle-manager-packageserver 4.15.0-0.nightly-2023-12-11-033133 True False False 2m57s service-ca 4.15.0-0.nightly-2023-12-11-033133 True False False 5h46m storage 4.15.0-0.nightly-2023-12-11-033133 True False False 3h28m % oc get pods -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovn-ipsec-host-bn5mm 1/1 Running 0 3h17m ovn-ipsec-host-dlg5c 1/1 Running 0 3h20m ovn-ipsec-host-dztzf 1/1 Running 0 3h14m ovn-ipsec-host-tfflr 1/1 Running 0 3h11m ovn-ipsec-host-wvkwq 1/1 Running 0 3h10m ovnkube-control-plane-85b45bf6cf-78tbq 2/2 Running 0 3h30m ovnkube-control-plane-85b45bf6cf-n5pqn 2/2 Running 0 3h33m ovnkube-node-4rwk4 8/8 Running 8 3h40m ovnkube-node-567rz 8/8 Running 8 3h34m ovnkube-node-c7hv4 8/8 Running 8 3h40m ovnkube-node-qmw49 8/8 Running 8 3h35m ovnkube-node-s2nsw 8/8 Running 0 3h36m Multiple pods on different nodes have the connection problems. % oc get pods -n openshift-network-diagnostics -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES network-check-source-5cd74f77cc-mlqvz 1/1 Running 0 134m 10.131.0.25 huirwang-46126-g66cb-compute-0 <none> <none> network-check-target-824mt 1/1 Running 1 139m 10.130.0.212 huirwang-46126-g66cb-control-plane-2 <none> <none> network-check-target-dzl7m 1/1 Running 1 140m 10.128.2.46 huirwang-46126-g66cb-compute-1 <none> <none> network-check-target-l224m 1/1 Running 1 133m 10.129.0.173 huirwang-46126-g66cb-control-plane-1 <none> <none> network-check-target-qd48q 1/1 Running 1 138m 10.128.0.148 huirwang-46126-g66cb-control-plane-0 <none> <none> network-check-target-sc8hn 1/1 Running 0 134m 10.131.0.3 huirwang-46126-g66cb-compute-0 <none> <none> % oc rsh -n openshift-network-diagnostics network-check-source-5cd74f77cc-mlqvz sh-5.1$ curl 10.130.0.212:8080 --connect-timeout 5 curl: (28) Connection timed out after 5000 milliseconds sh-5.1$ curl 10.128.2.46:8080 --connect-timeout 5 curl: (28) Connection timed out after 5001 milliseconds sh-5.1$ curl 10.129.0.173:8080 --connect-timeout 5 curl: (28) Connection timed out after 5001 milliseconds sh-5.1$ curl 10.128.0.148:8080 --connect-timeout 5 curl: (28) Connection timed out after 5001 milliseconds sh-5.1$ curl 10.131.0.3:8080 --connect-timeout 5 Hello, 10.131.0.25. You have reached 10.131.0.3 on huirwang-46126-g66cb-compute-0sh-5.1$
Actual results:
Upgrade failed.
Expected results:
Upgrade succeeded.
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
# internal CI failure
# customer issue / SD
# internal RedHat testing failure
If it is an internal RedHat testing failure:
* Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).
If it is a CI failure:
* Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
* Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
* Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
* When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
* If it's a connectivity issue,
* What is the srcNode, srcIP and srcNamespace and srcPodName?
* What is the dstNode, dstIP and dstNamespace and dstPodName?
* What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
If it is a customer / SD issue:
* Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
* Don’t presume that Engineering has access to Salesforce.
* Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment. The format should be: [https://access.redhat.com/support/cases/#/case/]<case number>/discussion?attachmentId=<attachment id>
* Describe what each attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
* Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
** If the issue is in a customer namespace then provide a namespace inspect.
** If it is a connectivity issue:
*** What is the srcNode, srcNamespace, srcPodName and srcPodIP?
*** What is the dstNode, dstNamespace, dstPodName and dstPodIP?
*** What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
*** Please provide the UTC timestamp networking outage window from must-gather
*** Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
** If it is not a connectivity issue:
*** Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
* For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
* For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with “sbr-untriaged”
* Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
Description of problem:
HostedClusters with a .status.controlPlaneEndpoint.port: 443 unexepectedly also expose the KAS on port 6443. This causes four security group rules to be consumed per LoadBalancer service (443/6443 for router and 443/6443 for private-router) instead of just two (443 for router and 443 for private-router). This directly impacts the number of HostedClusters on a Management Cluster since there is a hard cap of 200 security group rules per security group.
Version-Release number of selected component (if applicable):
4.14.0
How reproducible:
100%
Steps to Reproduce:
1. Create a HostedCluster resulting in its .status.controlPlaneEndpoint.port: 443 2. Observe that the router/private-router LoadBalancer services expose both ports 6443 and 443
Actual results:
The router/private-router LoadBalancer services expose both ports 6443 and 443
Expected results:
The router/private-router LoadBalancer services exposes only port 443
Additional info:
Please review the following PR: https://github.com/openshift/cluster-capi-operator/pull/147
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/1002
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The CVO managed manifest, that CMO ships lack capability annotations as defined in https://github.com/openshift/enhancements/blob/master/enhancements/installer/component-selection.md#manifest-annotations.
The dashboards should be tied to the console capability so that when CMO deploys on a cluster without the Console capability, CVO doesn't deploy the dashboards configmap.
Issue 57 from https://docs.google.com/spreadsheets/d/1TR3ENY-GE_LQL9F-xH6NahtRHdu6IrYGhLqDDA8y0EI/edit#gid=1035185624
Open the "Add Helm Chart Repositories to extend the Developer Catalog for your project" quick start. Go to the next step. You will see a code sample that does not have the right style if you've enabled dark theme.
Note: Could we check if we can also update the PatternFly quickstart extension??
Screenshot: https://drive.google.com/file/d/1hxh5VI2S7jLKRdNlDQsdlAXL_G7TxtME/view?usp=sharing
Please review the following PR: https://github.com/openshift/node_exporter/pull/131
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/network-tools/pull/87
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-etcd-operator/pull/1115
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Install Pipeline operator and setup tekton-results on the cluster 2. Create a PAC repository and trigger a PLR 3. open network tab and visit Repository list page
Actual results:
infinite internet API call
Expected results:
internet API call should not get call continuously
Additional info:
Description of problem:
When the installer gathers a log bundle after failure (either automatically or with gather bootstrap), the installer fails to return serial console logs if an SSH connection to the bootstrap node is refused. Even if the serial console logs were collected, the installer exits on error if ssh connection is refused: time="2024-03-09T20:59:26Z" level=info msg="Pulling VM console logs" time="2024-03-09T20:59:26Z" level=debug msg="Search for matching instances by tag in us-west-1 matching aws.Filter{\"kubernetes.io/cluster/ci-op-4ygffz3q-be93e-jnn92\":\"owned\"}" time="2024-03-09T20:59:26Z" level=debug msg="Search for matching instances by tag in us-west-1 matching aws.Filter{\"openshiftClusterID\":\"2f9d8822-46fd-4fcd-9462-90c766c3d158\"}" time="2024-03-09T20:59:27Z" level=debug msg="Attemping to download console logs for ci-op-4ygffz3q-be93e-jnn92-bootstrap" Instance=i-0413f793ffabe9339 time="2024-03-09T20:59:27Z" level=debug msg="Download complete" Instance=i-0413f793ffabe9339 time="2024-03-09T20:59:27Z" level=debug msg="Attemping to download console logs for ci-op-4ygffz3q-be93e-jnn92-master-0" Instance=i-0ab5f920818366bb8 time="2024-03-09T20:59:27Z" level=debug msg="Download complete" Instance=i-0ab5f920818366bb8 time="2024-03-09T20:59:27Z" level=debug msg="Attemping to download console logs for ci-op-4ygffz3q-be93e-jnn92-master-2" Instance=i-0b93963476818535d time="2024-03-09T20:59:27Z" level=debug msg="Download complete" Instance=i-0b93963476818535d time="2024-03-09T20:59:28Z" level=debug msg="Attemping to download console logs for ci-op-4ygffz3q-be93e-jnn92-master-1" Instance=i-0797728e092bfbeef time="2024-03-09T20:59:28Z" level=debug msg="Download complete" Instance=i-0797728e092bfbeef time="2024-03-09T20:59:28Z" level=info msg="Pulling debug logs from the bootstrap machine" time="2024-03-09T20:59:28Z" level=debug msg="Added /tmp/bootstrap-ssh3643557583 to installer's internal agent" time="2024-03-09T20:59:28Z" level=debug msg="Added /tmp/.ssh/ssh-privatekey to installer's internal agent" time="2024-03-09T21:01:39Z" level=error msg="Attempted to gather debug logs after installation failure: failed to connect to the bootstrap machine: dial tcp 13.57.212.80:22: connect: connection timed out" from: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_api/1788/pull-ci-openshift-api-master-e2e-aws-ovn/1766560949898055680 We can see the console logs were downloaded, they should be saved in the log bundle.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Failed install where SSH to bootstrap node fails. https://github.com/openshift/installer/pull/8137 provides a potential reproducer 2. 3.
Actual results:
Expected results:
Additional info:
Error handling needs to be reworked here: https://github.com/openshift/installer/blob/master/cmd/openshift-install/gather.go#L160-L190
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
From CLBO ovnkube-node logs: Upgrade hack: Timed out waiting for the remote ovnkube-controller to be ready even after 5 minutes, err : context deadline exceeded, unable to fetch node-subnet annotation for node ip-10-0-133-201.us-east-2.compute.internal: err, could not find "k8s.ovn.org/node-subnets" annotation ovnkube-controller not ready implies the absence of node-subnets annotation CNO upgrade stuck at DaemonSet "/openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2023-08-30T10:06:44Z
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-08-11-055332
How reproducible:
Always
Steps to Reproduce:
1.Install OCP cluster with RHCOS and win nodes on 4.13 2.Perform upgrade to 4.14 3.
Actual results:
Upgrades failed on CNO
Expected results:
Upgrade should pass
Additional info:
must-gather: http://shell.lab.bos.redhat.com/~anusaxen/must-gather.local.1473221474492991466/
update packages versions in ironic-agent container to bring in latest fixes
This came out of the https://bugzilla.redhat.com/show_bug.cgi?id=1943704.
Add dashboard for iowait CPU on master nodes, this will help customers and customer support or us identify problems that result in leader election - we can see that often due to high iowait, aligning with large spikes in fsync and or peer to peer latency.
Query:
(sum(irate(node_cpu_seconds_total {mode="iowait"} [2m])) without (cpu)) / count(node_cpu_seconds_total) without (cpu) * 100 AND on (instance) label_replace( kube_node_role{role="master"}, "instance", "$1", "node", "(.+)" )
Description of problem:
A net-attach-def using "type: ovn-k8s-cni-overlay, topology:layer2" does not work in a hosted pod when using the Kubevirt provider. Note: As a general hosted multus sanity check, using a "type: bridge" NAD does work properly in a hosted pod and both interfaces start as expected: Normal AddedInterface 86s multus Add eth0 [10.133.0.21/23] from ovn-kubernetes Normal AddedInterface 86s multus Add net1 [192.0.2.193/27] from default/bridge-net
Version-Release number of selected component (if applicable):
OCP 4.14.1 CNV 4.14.0-2385
How reproducible:
Reproduced w/ multiple attempts when using OVN secondary network
Steps to Reproduce:
1. Create the NAD on the hosted Kubevirt cluster: apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: name: l2-network spec: config: |- { "cniVersion": "0.3.1", "name": "l2-network", "type": "ovn-k8s-cni-overlay", "topology":"layer2", "netAttachDefName": "default/l2-network" } 2. Create a hosted pod w/ that net annotation: apiVersion: v1 kind: Pod metadata: annotations: k8s.v1.cni.cncf.io/networks: '[ { "name": "l2-network", "interface": "net1", "ips": [ "192.0.2.22/24" ] } ]' name: debug-ovnl2-c namespace: default spec: securityContext: seccompProfile: type: RuntimeDefault runAsNonRoot: true runAsUser: 1000 containers: - name: debug-ovnl2-c command: - /usr/bin/bash - -x - -c - | sleep infinity image: quay.io/cloud-bulldozer/uperf:latest imagePullPolicy: Always securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL nodeSelector: kubernetes.io/hostname: kv1-a8a5d7f1-9xwm4 3. Pod remains in ContainerCreating because it cannot create the net1 iface, pod describe event logs: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 4m21s default-scheduler Successfully assigned default/debug-ovnl2-c to kv1-a8a5d7f1-9xwm4 Warning FailedCreatePodSandBox 2m20s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_debug-ovnl2-c_default_1b42bc5a-1148-49d8-a2d0-7689a46f59ea_0(1e2d9008074c3c5af5ccbb2e7e2e7ca2466395b642a1677db2dfadd35eb84b73): error adding pod default_debug-ovnl2-c to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&{ContainerID:1e2d9008074c3c5af5ccbb2e7e2e7ca2466395b642a1677db2dfadd35eb84b73 Netns:/var/run/netns/5da048e3-b534-481d-acc6-2ddc6a439586 IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=debug-ovnl2-c;K8S_POD_INFRA_CONTAINER_ID=1e2d9008074c3c5af5ccbb2e7e2e7ca2466395b642a1677db2dfadd35eb84b73;K8S_POD_UID=1b42bc5a-1148-49d8-a2d0-7689a46f59ea Path: StdinData:[123 34 98 105 110 68 105 114 34 58 34 47 118 97 114 47 108 105 98 47 99 110 105 47 98 105 110 34 44 34 99 104 114 111 111 116 68 105 114 34 58 34 47 104 111 115 116 114 111 111 116 34 44 34 99 108 117 115 116 101 114 78 101 116 119 111 114 107 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 99 110 105 47 110 101 116 46 100 47 49 48 45 111 118 110 45 107 117 98 101 114 110 101 116 101 115 46 99 111 110 102 34 44 34 99 110 105 67 111 110 102 105 103 68 105 114 34 58 34 47 104 111 115 116 47 101 116 99 47 99 110 105 47 110 101 116 46 100 34 44 34 99 110 105 86 101 114 115 105 111 110 34 58 34 48 46 51 46 49 34 44 34 100 97 101 109 111 110 83 111 99 107 101 116 68 105 114 34 58 34 47 114 117 110 47 109 117 108 116 117 115 47 115 111 99 107 101 116 34 44 34 103 108 111 98 97 108 78 97 109 101 115 112 97 99 101 115 34 58 34 100 101 102 97 117 108 116 44 111 112 101 110 115 104 105 102 116 45 109 117 108 116 117 115 44 111 112 101 110 115 104 105 102 116 45 115 114 105 111 118 45 110 101 116 119 111 114 107 45 111 112 101 114 97 116 111 114 34 44 34 108 111 103 76 101 118 101 108 34 58 34 118 101 114 98 111 115 101 34 44 34 108 111 103 84 111 83 116 100 101 114 114 34 58 116 114 117 101 44 34 109 117 108 116 117 115 65 117 116 111 99 111 110 102 105 103 68 105 114 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 99 110 105 47 110 101 116 46 100 34 44 34 109 117 108 116 117 115 67 111 110 102 105 103 70 105 108 101 34 58 34 97 117 116 111 34 44 34 110 97 109 101 34 58 34 109 117 108 116 117 115 45 99 110 105 45 110 101 116 119 111 114 107 34 44 34 110 97 109 101 115 112 97 99 101 73 115 111 108 97 116 105 111 110 34 58 116 114 117 101 44 34 112 101 114 78 111 100 101 67 101 114 116 105 102 105 99 97 116 101 34 58 123 34 98 111 111 116 115 116 114 97 112 75 117 98 101 99 111 110 102 105 103 34 58 34 47 118 97 114 47 108 105 98 47 107 117 98 101 108 101 116 47 107 117 98 101 99 111 110 102 105 103 34 44 34 99 101 114 116 68 105 114 34 58 34 47 101 116 99 47 99 110 105 47 109 117 108 116 117 115 47 99 101 114 116 115 34 44 34 99 101 114 116 68 117 114 97 116 105 111 110 34 58 34 50 52 104 34 44 34 101 110 97 98 108 101 100 34 58 116 114 117 101 125 44 34 115 111 99 107 101 116 68 105 114 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 115 111 99 107 101 116 34 44 34 116 121 112 101 34 58 34 109 117 108 116 117 115 45 115 104 105 109 34 125]} ContainerID:"1e2d9008074c3c5af5ccbb2e7e2e7ca2466395b642a1677db2dfadd35eb84b73" Netns:"/var/run/netns/5da048e3-b534-481d-acc6-2ddc6a439586" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=debug-ovnl2-c;K8S_POD_INFRA_CONTAINER_ID=1e2d9008074c3c5af5ccbb2e7e2e7ca2466395b642a1677db2dfadd35eb84b73;K8S_POD_UID=1b42bc5a-1148-49d8-a2d0-7689a46f59ea" Path:"" ERRORED: error configuring pod [default/debug-ovnl2-c] networking: [default/debug-ovnl2-c/1b42bc5a-1148-49d8-a2d0-7689a46f59ea:l2-network]: error adding container to network "l2-network": CNI request failed with status 400: '[default/debug-ovnl2-c 1e2d9008074c3c5af5ccbb2e7e2e7ca2466395b642a1677db2dfadd35eb84b73 network l2-network NAD default/l2-network] [default/debug-ovnl2-c 1e2d9008074c3c5af5ccbb2e7e2e7ca2466395b642a1677db2dfadd35eb84b73 network l2-network NAD default/l2-network] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded ' ' Warning FailedCreatePodSandBox 19s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_debug-ovnl2-c_default_1b42bc5a-1148-49d8-a2d0-7689a46f59ea_0(48110f0ecc0979992108e4441ff06f50c0d90f527cbe0b8fe1ca18d5398b67eb): error adding pod default_debug-ovnl2-c to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&{ContainerID:48110f0ecc0979992108e4441ff06f50c0d90f527cbe0b8fe1ca18d5398b67eb Netns:/var/run/netns/cae8fab7-80c2-40b7-b1a7-49c8fc8732b2 IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=debug-ovnl2-c;K8S_POD_INFRA_CONTAINER_ID=48110f0ecc0979992108e4441ff06f50c0d90f527cbe0b8fe1ca18d5398b67eb;K8S_POD_UID=1b42bc5a-1148-49d8-a2d0-7689a46f59ea Path: StdinData:[123 34 98 105 110 68 105 114 34 58 34 47 118 97 114 47 108 105 98 47 99 110 105 47 98 105 110 34 44 34 99 104 114 111 111 116 68 105 114 34 58 34 47 104 111 115 116 114 111 111 116 34 44 34 99 108 117 115 116 101 114 78 101 116 119 111 114 107 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 99 110 105 47 110 101 116 46 100 47 49 48 45 111 118 110 45 107 117 98 101 114 110 101 116 101 115 46 99 111 110 102 34 44 34 99 110 105 67 111 110 102 105 103 68 105 114 34 58 34 47 104 111 115 116 47 101 116 99 47 99 110 105 47 110 101 116 46 100 34 44 34 99 110 105 86 101 114 115 105 111 110 34 58 34 48 46 51 46 49 34 44 34 100 97 101 109 111 110 83 111 99 107 101 116 68 105 114 34 58 34 47 114 117 110 47 109 117 108 116 117 115 47 115 111 99 107 101 116 34 44 34 103 108 111 98 97 108 78 97 109 101 115 112 97 99 101 115 34 58 34 100 101 102 97 117 108 116 44 111 112 101 110 115 104 105 102 116 45 109 117 108 116 117 115 44 111 112 101 110 115 104 105 102 116 45 115 114 105 111 118 45 110 101 116 119 111 114 107 45 111 112 101 114 97 116 111 114 34 44 34 108 111 103 76 101 118 101 108 34 58 34 118 101 114 98 111 115 101 34 44 34 108 111 103 84 111 83 116 100 101 114 114 34 58 116 114 117 101 44 34 109 117 108 116 117 115 65 117 116 111 99 111 110 102 105 103 68 105 114 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 99 110 105 47 110 101 116 46 100 34 44 34 109 117 108 116 117 115 67 111 110 102 105 103 70 105 108 101 34 58 34 97 117 116 111 34 44 34 110 97 109 101 34 58 34 109 117 108 116 117 115 45 99 110 105 45 110 101 116 119 111 114 107 34 44 34 110 97 109 101 115 112 97 99 101 73 115 111 108 97 116 105 111 110 34 58 116 114 117 101 44 34 112 101 114 78 111 100 101 67 101 114 116 105 102 105 99 97 116 101 34 58 123 34 98 111 111 116 115 116 114 97 112 75 117 98 101 99 111 110 102 105 103 34 58 34 47 118 97 114 47 108 105 98 47 107 117 98 101 108 101 116 47 107 117 98 101 99 111 110 102 105 103 34 44 34 99 101 114 116 68 105 114 34 58 34 47 101 116 99 47 99 110 105 47 109 117 108 116 117 115 47 99 101 114 116 115 34 44 34 99 101 114 116 68 117 114 97 116 105 111 110 34 58 34 50 52 104 34 44 34 101 110 97 98 108 101 100 34 58 116 114 117 101 125 44 34 115 111 99 107 101 116 68 105 114 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 115 111 99 107 101 116 34 44 34 116 121 112 101 34 58 34 109 117 108 116 117 115 45 115 104 105 109 34 125]} ContainerID:"48110f0ecc0979992108e4441ff06f50c0d90f527cbe0b8fe1ca18d5398b67eb" Netns:"/var/run/netns/cae8fab7-80c2-40b7-b1a7-49c8fc8732b2" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=debug-ovnl2-c;K8S_POD_INFRA_CONTAINER_ID=48110f0ecc0979992108e4441ff06f50c0d90f527cbe0b8fe1ca18d5398b67eb;K8S_POD_UID=1b42bc5a-1148-49d8-a2d0-7689 a46f59ea" Path:"" ERRORED: error configuring pod [default/debug-ovnl2-c] networking: [default/debug-ovnl2-c/1b42bc5a-1148-49d8-a2d0-7689a46f59ea:l2-network]: error adding container to network "l2-network": CNI request failed with status 400: '[default/debug-ovnl2-c 48110f0ecc0979992108e4441ff06f50c0d90f527cbe0b8fe1ca18d5398b67eb network l2-network NAD default/l2-network] [default/debug-ovnl2-c 48110f0ecc0979992108e4441ff06f50c0d90f527cbe0b8fe1ca18d5398b67eb network l2-network NAD default/l2-network] failed to get pod annotation: timed out waiting for annotations: context deadline exceeded ' ' Normal AddedInterface 18s (x3 over 4m20s) multus Add eth0 [10.133.0.21/23] from ovn-kubernetes
Actual results:
Pod cannot start
Expected results:
Pod can start with additional "ovn-k8s-cni-overlay" network
Additional info:
Slack thread: https://redhat-internal.slack.com/archives/C02UVQRJG83/p1698857051578159 I did confirm the same NAD and pod definition start fine on the management cluster.
In our CI, pre-submit jobs for IBM VPC CSI driver and its operator are failing with:
[sig-arch] events should not repeat pathologically for ns/openshift-cluster-csi-drivers expand_less0s{ 16 events happened too frequently event happened 25 times, something is wrong: ns/openshift-cluster-csi-drivers pod/ibm-vpc-block-csi-node-vck82 node/ci-op-jsqf19qs-00b5a-mjg8w-master-1 hmsg/99d84ba4c3 - pathological/true reason/FailedToRetrieveImagePullSecret Unable to retrieve some image pull secrets (bluemix-default-secret, bluemix-default-secret-regional, bluemix-default-secret-international, icr-io-secret); attempting to pull the image may not succeed. From: 06:44:57Z To: 06:44:58Z result=reject
Example:
Operator CI:
Driver CI:
The driver itself looks working, so it's probably just a transient, but annoying error.
Please review the following PR: https://github.com/openshift/baremetal-runtimecfg/pull/289
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/115
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-etcd-operator/pull/1169
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
backport of https://issues.redhat.com//browse/OCPBUGS-27211
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Issue 52 from https://docs.google.com/spreadsheets/d/1TR3ENY-GE_LQL9F-xH6NahtRHdu6IrYGhLqDDA8y0EI/edit#gid=1035185624
Resource YAML view: Click on "Show tooltips" let the current page crash
Screenshot: https://drive.google.com/file/d/1lT3mUAPIm0ba5tNVDW3Ztz6Hgj4D1DFz/view?usp=drive_link
Description of problem:
Please check: https://issues.redhat.com/browse/OCPBUGS-18702?focusedId=23021716&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-23021716 for more details. https://drive.google.com/drive/folders/14aSJs-lO6HC-2xYFlOTJtCZIQg3ekE85?usp=sharing (plz check recording "sc_form_typeerror.mp4").
Issues: 1. TypeError mentioned above. 2. Default params added by an extension are not getting added to the created StorageClass. 3. Validation for parameters added by an extension in not working correctly as well. 4. The Provisioner child details will be stuck once user selected 'openshift-storage.cephfs.csi.ceph.com'.
Version-Release number of selected component (if applicable):
4.14 (OCP)
How reproducible:
Steps to Reproduce:
1. Install ODF operator. 2. Create StorageSystem (once dynamic plugin is loaded). 3. Wait for a while for ODF related StorageClasses gets created. 4. Once they are created, go to "Create StorageSystem" form. 5. Switch to provisioners (rbd.csi.ceph) added by ODF dynamic plugin.
Actual results:
Page breaks with an error.
Expected results:
Page should not break. And functionality should be how it was acting before the refactoring introduced by PR: https://github.com/openshift/console/pull/13036
Additional info:
Stack trace: Caught error in a child component: TypeError: Cannot read properties of undefined (reading 'parameters') at allRequiredFieldsFilled (storage-class-form.tsx:204:1) at validateForm (storage-class-form.tsx:235:1) at storage-class-form.tsx:262:1 at invokePassiveEffectCreate (react-dom.development.js:23487:1) at HTMLUnknownElement.callCallback (react-dom.development.js:3945:1) at Object.invokeGuardedCallbackDev (react-dom.development.js:3994:1) at invokeGuardedCallback (react-dom.development.js:4056:1) at flushPassiveEffectsImpl (react-dom.development.js:23574:1) at unstable_runWithPriority (scheduler.development.js:646:1) at runWithPriority$1 (react-dom.development.js:11276:1) {componentStack: '\n at StorageClassFormInner (http://localhost:90...c03030668ef271da51f.js:491534:20)\n at Suspense'}
Description of problem:
There is a problem with the logic change in https://github.com/openshift/machine-config-operator/pull/4196 that is causing Kubelet to fail to start after a reboot on OpenShiftSDN deployments. This is currently breaking all of the v4 metal jobs.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. Deploy baremetal cluster with OpenShiftSDN 2. 3.
Actual results:
Nodes fail to join cluster
Expected results:
Successful cluster deployment
Additional info:
This is a clone of issue OCPBUGS-27323. The following is the description of the original issue:
—
Description of problem:
Observing the following test case failure in 4.14 to 4.15 and 4.15 to 4.16 upgrade CI runs continuously. [bz-Image Registry] clusteroperator/image-registry should not change condition/Available
4.14 Image: registry.ci.openshift.org/ocp-ppc64le/release-ppc64le:4.14.0-0.nightly-ppc64le-2024-01-15-085349
4.15 Image: registry.ci.openshift.org/ocp-ppc64le/release-ppc64le:4.15.0-0.nightly-ppc64le-2024-01-15-042536
Please review the following PR: https://github.com/openshift/csi-external-resizer/pull/144
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The following pre submit jobs for Local Zones are perm failing since August: - e2e-aws-ovn-localzones: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-installer-master-e2e-aws-ovn-localzones?buildId=1716457254460329984 - e2e-aws-ovn-shared-vpc-localzones: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-installer-master-e2e-aws-ovn-shared-vpc-localzones Investigating we can see common failures in tests '[sig-network] can collect <poller_name> poller pod logs', leading the most of jobs to not completed correctly for those failures. Exploring the code I can see it was recently added, near August and matches with when the failures started. It is required to tolerate the label "node-role.kubernetes.io/edge" to run pods on instances located in Local Zone ("edge nodes"). I am not sure if I am looking in the correct place, but it seems it is tolerating only master labels: https://github.com/openshift/origin/blob/master/pkg/monitortests/network/disruptionpodnetwork/host-network-target-deployment.yaml#L42
Version-Release number of selected component (if applicable):
4.15.0
How reproducible:
always
Steps to Reproduce:
trigger the job: 1. open a PR on installer 2. run the job 3. check failed tests '[sig-network] can collect <poller_name> poller pod logs' Example of 4.15 blocked feature PR (Wavelength Zones): https://github.com/openshift/installer/pull/7369#issuecomment-1783699175
Actual results:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/7590/pull-ci-openshift-installer-master-e2e-aws-ovn-localzones/1715075142427611136 { 1 pods lacked sampler output: [pod-network-to-pod-network-disruption-poller-d94fb55db-9qfpz]} E1018 22:06:34.773866 1 disruption_backend_sampler.go:496] not finished writing all samples (1 remaining), but we're told to close E1018 22:06:34.774669 1 disruption_backend_sampler.go:496] not finished writing all samples (1 remaining), but we're told to close
Expected results:
Monitor jobs be scheduled in edge nodes? How we can track job failures for new monitor tests?
Additional info:
Edge nodes have NoSchedule taints applied by default, to run monitor pods in those nodes you need to tolerate the label "node-role.kubernetes.io/edge" See the enhancement for more informaation: https://github.com/openshift/enhancements/blob/master/enhancements/installer/aws-custom-edge-machineset-local-zones.md#user-workload-deployments Looking the must-gather of job 1716457254460329984, you can see the monitor pods not scheduled due the missing tolerations: $ grep -rni pod-network-to-pod-network-disruption-poller-7c97cd5d7-t2mn2 \ 1716457254460329984-must-gather/09abb0d6fc08ee340563e6e11f5ceafb42fb371e50ab6acee6764031062525b7/namespaces/openshift-kube-scheduler/pods/ \ | awk -F'] "' '{print$2}' | sort | uniq -c 215 Unable to schedule pod; no fit; waiting" pod="e2e-pod-network-disruption-test-59s5d/pod-network-to-pod-network-disruption-poller-7c97cd5d7-t2mn2" err="0/7 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/edge: }, 6 node(s) didn't match pod anti-affinity rules. preemption: 0/7 nodes are available: 1 Preemption is not helpful for scheduling, 6 No preemption victims found for incoming pod.."
Please review the following PR: https://github.com/openshift/platform-operators/pull/104
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Since the singular variant of APIVIP/IngressVIP has been removed as part of https://github.com/openshift/installer/pull/7574, the appliance disk image e2e job is now failing: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-appliance-master-e2e-compact-ipv4-static The job fails since th appliance support only 4.14, which still requires the singular variant of the VIP properties.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. Invoke appliance e2e job on master: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-appliance-master-e2e-compact-ipv4-static
Actual results:
Job fails with the following validation error: "the Machine Network CIDR can be defined by setting either the API or Ingress virtual IPs" Due to missing apiVIP and ingressVIP in AgentClusterInstall.
Expected results:
AgentClusterInstall should include also the singular 'apiVIP' and 'ingressVIP', and the e2e job should successfully complete
Additional info:
This is a clone of issue OCPBUGS-28744. The following is the description of the original issue:
—
Description of problem:
$ oc adm upgrade info: An upgrade is in progress. Working towards 4.15.0-rc.4: 701 of 873 done (80% complete), waiting on operator-lifecycle-manager Upstream: https://api.openshift.com/api/upgrades_info/v1/graph Channel: candidate-4.15 (available channels: candidate-4.15, candidate-4.16) No updates available. You may still upgrade to a specific release image with --to-image or wait for new updates to be available. $ oc get pods -n openshift-operator-lifecycle-manager NAME READY STATUS RESTARTS AGE catalog-operator-db86b7466-gdp4g 1/1 Running 0 9h collect-profiles-28443465-9zzbk 0/1 Completed 0 34m collect-profiles-28443480-kkgtk 0/1 Completed 0 19m collect-profiles-28443495-shvs7 0/1 Completed 0 4m10s olm-operator-56cb759d88-q2gr7 0/1 CrashLoopBackOff 8 (3m27s ago) 20m package-server-manager-7cf46947f6-sgnlk 2/2 Running 0 9h packageserver-7b795b79f-thxfw 1/1 Running 1 14d packageserver-7b795b79f-w49jj 1/1 Running 0 4d17h
Version-Release number of selected component (if applicable):
How reproducible:
Unknown
Steps to Reproduce:
Upgrade from 4.15.0-rc.2 to 4.15.0-rc.4
Actual results:
The upgrade is unable to proceed
Expected results:
The upgrade can proceed
Additional info:
Description of problem:
The installer supports pre-rendering of the PerformanceProfile related manifests. However the MCO render is executed after the PerfProfile render and so the master and worker MachineConfigPools are created too late. This causes the installation process to fail with: Oct 18 18:05:25 localhost.localdomain bootkube.sh[537963]: I1018 18:05:25.968719 1 render.go:73] Rendering files into: /assets/node-tuning-bootstrap Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.008421 1 render.go:133] skipping "/assets/manifests/99_feature-gate.yaml" [1] manifest because of unhandled *v1.FeatureGate Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.013043 1 render.go:133] skipping "/assets/manifests/cluster-dns-02-config.yml" [1] manifest because of unhandled *v1.DNS Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.021978 1 render.go:133] skipping "/assets/manifests/cluster-ingress-02-config.yml" [1] manifest because of unhandled *v1.Ingress Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.023016 1 render.go:133] skipping "/assets/manifests/cluster-network-02-config.yml" [1] manifest because of unhandled *v1.Network Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.023160 1 render.go:133] skipping "/assets/manifests/cluster-proxy-01-config.yaml" [1] manifest because of unhandled *v1.Proxy Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.023445 1 render.go:133] skipping "/assets/manifests/cluster-scheduler-02-config.yml" [1] manifest because of unhandled *v1.Scheduler Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: I1018 18:05:26.024475 1 render.go:133] skipping "/assets/manifests/cvo-overrides.yaml" [1] manifest because of unhandled *v1.ClusterVersion Oct 18 18:05:26 localhost.localdomain bootkube.sh[537963]: F1018 18:05:26.037467 1 cmd.go:53] no MCP found that matches performance profile node selector "node-role.kubernetes.io/master="
Version-Release number of selected component (if applicable):
4.14.0-rc.6
How reproducible:
Always
Steps to Reproduce:
1. Add an SNO PerformanceProfile to extra manifest in the installer. Node selector should be: "node-role.kubernetes.io/master=" 2. 3.
Actual results:
no MCP found that matches performance profile node selector "node-role.kubernetes.io/master="
Expected results:
Installation completes
Additional info:
apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: openshift-node-workload-partitioning-sno spec: cpu: isolated: 4-X <- must match the topology of the node reserved: 0-3 nodeSelector: node-role.kubernetes.io/master: ""
Description of problem:
Oh no! Something went wrong’ will be shown when user go to MultiClusterEngine details -> Yaml tab
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-20-215234
How reproducible:
Always
Steps to Reproduce:
1. Install 'multicluster engine for Kubernetes' operator in the cluster 2. Use the default value to create a new MultiClusterEngine 3. Navigate to the MultiClusterEngine details -> Yaml Tab
Actual results: ‘Oh no! Something went wrong.’ error will be shown with below details TypeErrorDescription: Cannot read properties of null (reading 'editor')
Expected results:
no error
Additional info:
This bug fix is in conjunction with https://issues.redhat.com/browse/OCPBUGS-22778
We should warn loudly in logs when customers change managmentState of a CSI operator rather than logging with lower level log messages.
I spend non-trivial amount of time debugging a cluster where CSI driver won't get installed, only to find out that customer has somehow set managmentState to Removed.
Description of problem:
A stack trace is output when creating a hosted cluster via the hypershift CLI
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Run hypershift create cluster aws ... to create a hosted cluster
Actual results:
The output will contain: [controller-runtime] log.SetLogger(...) was never called; logs will not be displayed. Detected at: > goroutine 1 [running]: > runtime/debug.Stack() > /opt/homebrew/Cellar/go/1.21.4/libexec/src/runtime/debug/stack.go:24 +0x64 > sigs.k8s.io/controller-runtime/pkg/log.eventuallyFulfillRoot() > /Users/xinjiang/Codes/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/log/log.go:60 +0xa0 > sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).WithName(0x14000845480, {0x10321d605, 0x14}) > /Users/xinjiang/Codes/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:147 +0x34 > github.com/go-logr/logr.Logger.WithName({{0x10490a710, 0x14000845480}, 0x0}, {0x10321d605, 0x14}) > /Users/xinjiang/Codes/hypershift/vendor/github.com/go-logr/logr/logr.go:336 +0x5c > sigs.k8s.io/controller-runtime/pkg/client.newClient(0x1400097a900, {0x0, 0x140004a42a0, {0x0, 0x0}, 0x0, {0x0, 0x0}, 0x0}) > /Users/xinjiang/Codes/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/client/client.go:122 +0xf8 > sigs.k8s.io/controller-runtime/pkg/client.New(0x14000ef98c0, {0x0, 0x140004a42a0, {0x0, 0x0}, 0x0, {0x0, 0x0}, 0x0}) > /Users/xinjiang/Codes/hypershift/vendor/sigs.k8s.io/controller-runtime/pkg/client/client.go:103 +0x78 > github.com/openshift/hypershift/cmd/util.GetClient() > /Users/xinjiang/Codes/hypershift/cmd/util/client.go:50 +0x4f4 > github.com/openshift/hypershift/cmd/cluster/core.apply({0x104906d88, 0x140008dfb80}, {{0x10490acf8, 0x140009dd410}, 0x0}, 0x14000ef9200, 0x0, 0x0) > /Users/xinjiang/Codes/hypershift/cmd/cluster/core/create.go:324 +0xc8 > github.com/openshift/hypershift/cmd/cluster/core.CreateCluster({0x104906d88, 0x140008dfb80}, 0x1400056f600, 0x1048c4360) > /Users/xinjiang/Codes/hypershift/cmd/cluster/core/create.go:461 +0x264 > github.com/openshift/hypershift/cmd/cluster/aws.CreateCluster({0x104906d88, 0x140008dfb80}, 0x1400056f600) > /Users/xinjiang/Codes/hypershift/cmd/cluster/aws/create.go:79 +0x78 > github.com/openshift/hypershift/cmd/cluster/aws.NewCreateCommand.func1(0x14000d1ac00, {0x14000a6cf70, 0x0, 0xd}) > /Users/xinjiang/Codes/hypershift/cmd/cluster/aws/create.go:65 +0x148 > github.com/spf13/cobra.(*Command).execute(0x14000d1ac00, {0x1400014c040, 0xd, 0xe}) > /Users/xinjiang/Codes/hypershift/vendor/github.com/spf13/cobra/command.go:940 +0x90c > github.com/spf13/cobra.(*Command).ExecuteC(0x14000c91800) > /Users/xinjiang/Codes/hypershift/vendor/github.com/spf13/cobra/command.go:1068 +0x770 > github.com/spf13/cobra.(*Command).Execute(0x14000c91800) > /Users/xinjiang/Codes/hypershift/vendor/github.com/spf13/cobra/command.go:992 +0x30 > github.com/spf13/cobra.(*Command).ExecuteContext(0x14000c91800, {0x104906d88, 0x140008dfb80}) > /Users/xinjiang/Codes/hypershift/vendor/github.com/spf13/cobra/command.go:985 +0x70 > main.main() > /Users/xinjiang/Codes/hypershift/main.go:70 +0x46c 2023-11-15T18:24:26+08:00 INFO Applied Kube resource {"kind": "Namespace", "namespace": "", "name": "clusters"}
Expected results:
No stack trace is output
Additional info:
The function is not affected, the cluster still creates.
In 4.15 when the agent installer is run using the openshift-baremetal-installer binary using an install-config containing platform data, it attempts to contact libvirt to validate the provisioning network interfaces for the bootstrap VM. This should never happen, as the agent installer doesn't use the bootstrap VM.
It is possible that users in the process of converting from baremetal IPI to the agent installer might run into this issue, since they would already be using the openshift-baremetal-installer binary.
Please review the following PR: https://github.com/openshift/cloud-provider-vsphere/pull/58
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This fix contains the following changes coming from updated version of kubernetes up to v1.28.7:
Changelog:
v1.28.7: https://github.com/kubernetes/kubernetes/blob/release-1.28/CHANGELOG/CHANGELOG-1.28.md#changelog-since-v1286
Description of problem:
This is an issue that IBM Cloud found and it likely effects Power VS. See https://issues.redhat.com/browse/OCPBUGS-28870 Install a private cluster, the base domain set in install-config.yaml is same as another existed cis domain name. After destroy the private cluster, the dns resource-records remains.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1.create a DNS service instance, setting its domain to "ibmcloud.qe.devcluster.openshift.com", Note, this domain name is also being used in another existing CIS domain. 2.Install a private ibmcloud cluster, the base domain set in install-config is "ibmcloud.qe.devcluster.openshift.com" 3.Destroy the cluster 4.Check the remains dns records
Actual results:
$ ibmcloud dns resource-records 5f8a0c4d-46c2-4daa-9157-97cb9ad9033a -i preserved-openshift-qe-private | grep ci-op-17qygd06-23ac4 api-int.ci-op-17qygd06-23ac4.ibmcloud.qe.devcluster.openshift.com *.apps.ci-op-17qygd06-23ac4.ibmcloud.qe.devcluster.openshift.com api.ci-op-17qygd06-23ac4.ibmcloud.qe.devcluster.openshift.com
Expected results:
No more dns records about the cluster
Additional info:
$ ibmcloud dns zones -i preserved-openshift-qe-private | awk '{print $2}' Name private-ibmcloud.qe.devcluster.openshift.com private-ibmcloud-1.qe.devcluster.openshift.com ibmcloud.qe.devcluster.openshift.com $ ibmcloud cis domains Name ibmcloud.qe.devcluster.openshift.com When use private-ibmcloud.qe.devcluster.openshift.com and private-ibmcloud-1.qe.devcluster.openshift.com as domain, no such issue, when use ibmcloud.qe.devcluster.openshift.com as domain the dns records remains.
It was renamed between ec.1 and ec.2:
$ oc adm release extract --to ec.1 quay.io/openshift-release-dev/ocp-release:4.15.0-ec.1-x86_64 $ oc adm release extract --to ec.2 quay.io/openshift-release-dev/ocp-release:4.15.0-ec.2-x86_64 $ yaml2json <ec.1/0000_30_cluster-api_10_webhooks.yaml | jq -r .metadata.name validating-webhook-configuration $ yaml2json <ec.2/0000_30_cluster-api_10_webhooks.yaml | jq -r .metadata.name cluster-capi-operator
And the presence of the old config breaks updates across the gap, as the operator tries to act on resources that are still guarded by a webhook config, despite there no longer being anything serving the hooks it had pointed at. Or something like that. In any case, the cluster-api ClusterOperator goes Degraded=True on SyncingFailed with {{Failed to resync for operator: 4.15.0-ec.2 because &
{%!e(string=unable to reconcile CoreProvider: unable to create or update CoreProvider: Internal error occurred: failed calling webhook "vcoreprovider.operator.cluster.x-k8s.io": failed to call webhook: the server could not find the requested resource)}}} until the old ValidatingWebhookConfiguration is deleted, and after that deletion, the ClusterOperator recovers.
4.15.0-ec.2.
Untested, but I'd guess 100%.
1. Install a tech-preview 4.15.0-ec.1 cluster.
2. Request an update to 4.15.0-ec.2.
3. Wait an hour or so.
cluster-api ConsoleOperator is Degraded=True, blocking further progress in the ClusterVersion update.
ClusterVersion update happily completes.
This is a clone of issue OCPBUGS-26940. The following is the description of the original issue:
—
Description of problem:
If OLMPlacement is set to management, the cluster is up with disableAllDefaultSources set to true, remove it in the HostedCluster CR, in the guest cluster disableAllDefaultSources isn't removed and still set to true
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The secret/vmware-vsphere-cloud-credentials in ns/openshift-cluster-csi-drivers is not synced correctly when updating secret/vsphere-creds in ns/kube-system
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-10-10-084534
How reproducible:
Always
Steps to Reproduce:
$ oc -n kube-system get secret vsphere-creds -o yaml apiVersion: v1 data: vcenter.devqe.ibmc.devcluster.openshift.com.password: xxx vcenter.devqe.ibmc.devcluster.openshift.com.username: xxx kind: Secret metadata: annotations: cloudcredential.openshift.io/mode: passthrough ...
Same for the secret/vmware-vsphere-cloud-credentials in ns/openshift-cluster-csi-drivers
$ oc -n openshift-cluster-csi-drivers get secret vmware-vsphere-cloud-credentials -o yaml apiVersion: v1 data: vcenter.devqe.ibmc.devcluster.openshift.com.password: xxx vcenter.devqe.ibmc.devcluster.openshift.com.username: xxx kind: Secret metadata: annotations: cloudcredential.openshift.io/credentials-request: openshift-cloud-credential-operator/openshift-vmware-vsphere-csi-driver-operator …
$ oc -n kube-system get secret vsphere-creds -o yaml apiVersion: v1 data: vcsa2-qe.vmware.devcluster.openshift.com.password: xxx vcsa2-qe.vmware.devcluster.openshift.com.username: xxx (Updated to vcsa2-qe)
There are two vcenter info in vmware-vsphere-cloud-credentials:
$ oc -n openshift-cluster-csi-drivers get secret vmware-vsphere-cloud-credentials -o yaml apiVersion: v1 data: vcenter.devqe.ibmc.devcluster.openshift.com.password: xxx vcenter.devqe.ibmc.devcluster.openshift.com.username: xxx vcsa2-qe.vmware.devcluster.openshift.com.password: xxx vcsa2-qe.vmware.devcluster.openshift.com.username: xxx (devqe and vcsa2-qe)
$ oc -n kube-system get secret vsphere-creds -o yaml apiVersion: v1 data: vcenter.devqe.ibmc.devcluster.openshift.com.password: xxx vcenter.devqe.ibmc.devcluster.openshift.com.username: xxx (Updated to devqe)
Still two vcenter info in vmware-vsphere-cloud-credentials:
$ oc -n openshift-cluster-csi-drivers get secret vmware-vsphere-cloud-credentials -o yaml apiVersion: v1 data: vcenter.devqe.ibmc.devcluster.openshift.com.password: xxx vcenter.devqe.ibmc.devcluster.openshift.com.username: xxx vcsa2-qe.vmware.devcluster.openshift.com.password: xxx vcsa2-qe.vmware.devcluster.openshift.com.username: xxx (devqe and vcsa2-qe)
Actual results:
The secret/vmware-vsphere-cloud-credentials is not synced well
Expected results:
The secret/vmware-vsphere-cloud-credentials should be synced well
Additional info:
Storage vSphere csi driver controller pods are crash looping.
Description of problem:
After extensive debugging on HostedControlPlanes in dual stack mode, we have discovered that QE department has issues in dual stack environments. In Hypershift/HostedControlPlane, we have an HAProxy in the dataplane (worker nodes of the HostedCluster). This HAProxy is unable to redirect calls to the KubeApiServer in the ControlPlane, attempts to connect using both protocols, IPv6 initially and then IPv4. The issue is that the HostedCluster is exposing services in NodePort mode, and it seems that the masterNodes of the management cluster are not opening these NodePorts in IPv6, only in IPv4. Even though the master node shows this trace with netstat: tcp6 9 0 :::32272 :::* LISTEN 6086/ovnkube It seems that it is only opening in IPv4, as it is not possible to connect to the API via IPv6 even locally. This only happens with dual stack; in the case of IPv4 and v6, it works correctly in single-stack mode.
Version-Release number of selected component (if applicable):
4.14.X 4.15.X
How reproducible:
100%
Steps to Reproduce:
1. Deploy an Openshift management cluster in dual stack mode 2. Deploy MCE 2.4 3. Deploy a HostedCluster in dual stack mode
Actual results:
- Many pods stuck in ContainerCreating state - The HostedCluster cannot be deployed, many COs blocked and clusterversion also stuck
Expected results:
HostedCluster deployment done
Additional info:
To reproduce the issue you could contact @jparrill or @Liangquan Li in slack, this will make things easier for the environment creation.
Description of problem:
iam:TagInstanceProfile is not listed in official document [1], IPI install would fail if iam:TagInstanceProfile permission is missing level=error msg=Error: creating IAM Instance Profile (ci-op-4hw2rz1v-49c30-zt9vx-worker-profile): AccessDenied: User: arn:aws:iam::301721915996:user/ci-op-4hw2rz1v-49c30-minimal-perm is not authorized to perform: iam:TagInstanceProfile on resource: arn:aws:iam::301721915996:instance-profile/ci-op-4hw2rz1v-49c30-zt9vx-worker-profile because no identity-based policy allows the iam:TagInstanceProfile action level=error msg= status code: 403, request id: bb0641f5-d01c-4538-b333-261a804ddb59 [1] https://docs.openshift.com/container-platform/4.14/installing/installing_aws/installing-aws-account.html#installation-aws-permissions_installing-aws-account
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-14-115151
How reproducible:
Always
Steps to Reproduce:
1. install a common IPI cluster with minimal permission provided in official document 2. 3.
Actual results:
Install failed.
Expected results:
Additional info:
install does a precheck for iam:TagInstanceProfile
The cluster-dns-operator repository vendors k8s.io/* v0.27.2 and controller-runtime v0.15.0. OpenShift 4.15 is based on Kubernetes 1.28.
4.15.
Always.
Check https://github.com/openshift/cluster-dns-operator/blob/release-4.15/go.mod.
The k8s.io/* packages are at v0.27.2, and the sigs.k8s.io/controller-runtime package is at v0.15.0.
The k8s.io/* packages are at v0.28.0 or newer, and the sigs.k8s.io/controller-runtime package is at v0.16.0 or newer.
The controller-runtime v0.16 release includes some breaking changes; see the release notes at https://github.com/kubernetes-sigs/controller-runtime/releases/tag/v0.16.0.
Description of problem:
Warning: spec.template.spec.nodeSelector[beta.kubernetes.io/os]: deprecated since v1.14; use "kubernetes.io/os" instead
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-08-17-145803
How reproducible:
Always
Steps to Reproduce:
1. oc rollout restart ds/ovnkube-node 2. 3.
Actual results:
Warning: spec.template.spec.nodeSelector[beta.kubernetes.io/os]: deprecated since v1.14; use "kubernetes.io/os" instead
Expected results:
No warning
When creating an Agent ISO for OCI, we should add the kernel argument console=ttyS0 to the ISO/PXE kargs.
CoreOS does not include a console arg by default when using metal as the platform because different hardware has different consoles and specifying one can cause booting to fail on some, but it does on many cloud platforms. Since we know when the user is definitely using OCI (there are validations in assisted that ensure it) and we know the correct settings for OCI, we should set them up automatically.
Description of problem:
A Github project with a Containerfile instead of a Dockerfile is not seen as a Buildah target, and the wizard falls through to templating as a standard (language) project.
Version-Release number of selected component (if applicable):
Server Version: 4.13.18 Kubernetes Version: v1.26.9+c7606e7
How reproducible:
Always
Steps to Reproduce:
1. Create a git application with Containerfile, e.g. https://github.com/cwilkers/jumble-c 2. Use the Developer view to add the app as a git repo 3. Observe failure as project is not built properly due to ignoring Containerfile
Actual results:
Build failure
Expected results:
Buildah includes Containerfile which includes html and other resources required for app
Additional info:
https://github.com/cwilkers/jumble-c
we need to update packages_ironic.yml to be closer to current opendev master upper constraints
after the new packages are created we'll have to tag them and update the ironic-image configuration
This is a clone of issue OCPBUGS-25862. The following is the description of the original issue:
—
At 17:26:09, the cluster is happily upgrading nodes:
An update is in progress for 57m58s: Working towards 4.14.1: 734 of 859 done (85% complete), waiting on machine-config
At 17:26:54, the upgrade starts to reboot master nodes and COs get noisy (this one specifically is OCPBUGS-20061)
An update is in progress for 58m50s: Unable to apply 4.14.1: the cluster operator control-plane-machine-set is not available
~Two minutes later, at 17:29:07, CVO starts to shout about waiting on operators for over 40 despite not indicating anything is wrong earlier:
An update is in progress for 1h1m2s: Unable to apply 4.14.1: wait has exceeded 40 minutes for these operators: etcd, kube-apiserver
This is only because these operators go briefly degraded during master reboot (which they shouldn't but that is a different story). CVO computes its 40 minutes against the time when it first started to upgrade the given operator so it:
1. Upgrades etcd / KAS very early in the upgrade, noting the time when it started to do that
2. These two COs upgrade successfuly and upgrade proceeds
3. Eventually cluster starts rebooting masters and etcd/KAS go degraded
4. CVO compares current time against the noted time, discovers its more than 40 minutes and starts warning about it.
all
Not entirely deterministic:
1. the upgrade must go for 40m+ between upgrading etcd and upgrading nodes
2. the upgrade must reboot a master that is not running CVO (otherwise there will be a new CVO instance without the saved times, they are only saved in memory)
1. Watch oc adm upgrade during the upgrade
Spurious "waiting for over 40m" message pops out of the blue
CVO simply says "waiting up to 40m on" and this eventually goes away as the node goes up and etcd goes out of degraded.
This is a clone of issue OCPBUGS-27222. The following is the description of the original issue:
—
Description of problem:
On ipv6primary dualstack cluster, creating an ipv6 egressIP following this procedure:
is not working. ovnkube-cluster-manager shows below error:
2024-01-16T14:48:18.156140746Z I0116 14:48:18.156053 1 obj_retry.go:358] Adding new object: *v1.EgressIP egress-dualstack-ipv6 2024-01-16T14:48:18.161367817Z I0116 14:48:18.161269 1 obj_retry.go:370] Retry add failed for *v1.EgressIP egress-dualstack-ipv6, will try again later: cloud add request failed for CloudPrivateIPConfig: fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333, err: CloudPrivateIPConfig.cloud.network.openshift.io "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333" is invalid: [<nil>: Invalid value: "": "metadata.name" must validate at least one schema (anyOf), metadata.name: Invalid value: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333": metadata.name in body must be of type ipv4: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333"] 2024-01-16T14:48:18.161416023Z I0116 14:48:18.161357 1 event.go:298] Event(v1.ObjectReference{Kind:"EgressIP", Namespace:"", Name:"egress-dualstack-ipv6", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'CloudAssignmentFailed' egress IP: fd2e:6f44:5dd8:c956:f816:3eff:fef0:3333 for object EgressIP: egress-dualstack-ipv6 could not be created, err: CloudPrivateIPConfig.cloud.network.openshift.io "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333" is invalid: [<nil>: Invalid value: "": "metadata.name" must validate at least one schema (anyOf), metadata.name: Invalid value: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333": metadata.name in body must be of type ipv4: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333"] 2024-01-16T14:49:37.714410622Z I0116 14:49:37.714342 1 reflector.go:790] k8s.io/client-go/informers/factory.go:159: Watch close - *v1.Service total 8 items received 2024-01-16T14:49:48.155826915Z I0116 14:49:48.155330 1 obj_retry.go:296] Retry object setup: *v1.EgressIP egress-dualstack-ipv6 2024-01-16T14:49:48.156172766Z I0116 14:49:48.155899 1 obj_retry.go:358] Adding new object: *v1.EgressIP egress-dualstack-ipv6 2024-01-16T14:49:48.168795734Z I0116 14:49:48.168520 1 obj_retry.go:370] Retry add failed for *v1.EgressIP egress-dualstack-ipv6, will try again later: cloud add request failed for CloudPrivateIPConfig: fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333, err: CloudPrivateIPConfig.cloud.network.openshift.io "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333" is invalid: [<nil>: Invalid value: "": "metadata.name" must validate at least one schema (anyOf), metadata.name: Invalid value: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333": metadata.name in body must be of type ipv4: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333"] 2024-01-16T14:49:48.169400971Z I0116 14:49:48.168937 1 event.go:298] Event(v1.ObjectReference{Kind:"EgressIP", Namespace:"", Name:"egress-dualstack-ipv6", UID:"", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'CloudAssignmentFailed' egress IP: fd2e:6f44:5dd8:c956:f816:3eff:fef0:3333 for object EgressIP: egress-dualstack-ipv6 could not be created, err: CloudPrivateIPConfig.cloud.network.openshift.io "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333" is invalid: [<nil>: Invalid value: "": "metadata.name" must validate at least one schema (anyOf), metadata.name: Invalid value: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333": metadata.name in body must be of type ipv4: "fd2e.6f44.5dd8.c956.f816.3eff.fef0.3333"]
Same is observed with ipv6 subnet on slaac mode.
Version-Release number of selected component (if applicable):
How reproducible: Always.
Steps to Reproduce:
Applying below:
$ oc label node/ostest-8zrlf-worker-0-4h78l k8s.ovn.org/egress-assignable="" $ cat egressip_ipv4.yaml && cat egressip_ipv6.yaml apiVersion: k8s.ovn.org/v1 kind: EgressIP metadata: name: egress-dualstack-ipv4 spec: egressIPs: - 192.168.192.111 namespaceSelector: matchLabels: app: egress apiVersion: k8s.ovn.org/v1 kind: EgressIP metadata: name: egress-dualstack-ipv6 spec: egressIPs: - fd2e:6f44:5dd8:c956:f816:3eff:fef0:3333 namespaceSelector: matchLabels: app: egress $ oc apply -f egressip_ipv4.yaml $ oc apply -f egressip_ipv6.yaml
But it shows only info about ipv4 egressIP. The IPv6 port is not even created in openstack:
oc logs -n openshift-cloud-network-config-controller cloud-network-config-controller-67cbc4bc84-786jm I0116 13:15:48.914323 1 controller.go:182] Assigning key: 192.168.192.111 to cloud-private-ip-config workqueue I0116 13:15:48.928927 1 cloudprivateipconfig_controller.go:357] CloudPrivateIPConfig: "192.168.192.111" will be added to node: "ostest-8zrlf-worker-0-4h78l" I0116 13:15:48.942260 1 cloudprivateipconfig_controller.go:381] Adding finalizer to CloudPrivateIPConfig: "192.168.192.111" I0116 13:15:48.943718 1 controller.go:182] Assigning key: 192.168.192.111 to cloud-private-ip-config workqueue I0116 13:15:49.758484 1 openstack.go:760] Getting port lock for portID 8854b2e9-3139-49d2-82dd-ee576b0a0cce and IP 192.168.192.111 I0116 13:15:50.547268 1 cloudprivateipconfig_controller.go:439] Added IP address to node: "ostest-8zrlf-worker-0-4h78l" for CloudPrivateIPConfig: "192.168.192.111" I0116 13:15:50.602277 1 controller.go:160] Dropping key '192.168.192.111' from the cloud-private-ip-config workqueue I0116 13:15:50.614413 1 controller.go:160] Dropping key '192.168.192.111' from the cloud-private-ip-config workqueue $ openstack port list --network network-dualstack | grep -e 192.168.192.111 -e 6f44:5dd8:c956:f816:3eff:fef0:3333 | 30fe8d9a-c1c6-46c3-a873-9a02e1943cb7 | egressip-192.168.192.111 | fa:16:3e:3c:23:2a | ip_address='192.168.192.111', subnet_id='ae8a4c1f-d3e4-4ea2-bc14-ef1f6f5d0bbe' | DOWN |
Actual results: ipv6 egressIP object is ignored.
Expected results: ipv6 egressIP is created and can be attached to a pod.
Additional info: must-gather linked in private comment.
Description of problem:
Test case failure- OpenShift alerting rules [apigroup:image.openshift.io] should have description and summary annotations The obtained response seems to have unmarshalling errors. Failed to fetch alerting rules: unable to parse response invalid character 's' after object key
Expected output- The response should be proper and the unmarshalling should have worked
Openshift Version- 4.13 & 4.14
Cloud Provider/Platform- PowerVS
Prow Job Link/Must gather path- https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.13-ocp-e2e-ovn-ppc64le-powervs/1700992665824268288/artifacts/ocp-e2e-ovn-ppc64le-powervs/
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/openshift-state-metrics/pull/102
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-25612. The following is the description of the original issue:
—
Description of problem:
Logs for PipelineRuns fetched from the Tekton Results API is not loading
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Navigate to the Log tab of PipelineRun fetched from the Tekton Results 2. 3.
Actual results:
Logs window is empty with a loading indicator
Expected results:
Logs should be shown
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Observation from CISv1.4 pdf: 1.1.3 Ensure that the controller manager pod specification file When I checked I found description of the controller manager pod specification file in CIS v1.4 PDF is as follows: "Ensure that the controller manager pod specification file has permissions of 600 or more restrictive. OpenShift 4 deploys two API servers: the OpenShift API server and the Kube API server. The OpenShift API server delegates requests for Kubernetes objects to the Kube API server. The OpenShift API server is managed as a deployment. The pod specification yaml for openshift-apiserver is stored in etcd. The Kube API Server is managed as a static pod. The pod specification file for the kube-apiserver is created on the control plane nodes at /etc/kubernetes/manifests/kube-apiserver-pod.yaml. The kube-apiserver is mounted via hostpath to the kube-apiserver pods via /etc/kubernetes/static-pod-resources/kube-apiserver-pod.yaml with permissions 600." To conform with CIS benchmarks, the controller manager pod specification file should be updated to 600. $ for i in $( oc get pods -n openshift-kube-controller-manager -o name -l app=kube-controller-manager) do oc exec -n openshift-kube-controller-manager $i -- stat -c %a /etc/kubernetes/static-pod-resources/kube-controller-manager-pod.yaml done 644 644 644
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-20-215234
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
The controller manager pod specification file for the kube-apiserver is 644.
Expected results:
The controller manager pod specification file for the kube-apiserver is 644.
Additional info:
https://github.com/openshift/library-go/commit/19a42d2bae8ba68761cfad72bf764e10d275ad6e
Please review the following PR: https://github.com/openshift/ironic-image/pull/397
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
A node fails to join cluster as it's CSR contains incorrect hostname
oc describe csr csr-7hftm Name: csr-7hftm Labels: <none> Annotations: <none> CreationTimestamp: Tue, 24 Oct 2023 10:22:39 -0400 Requesting User: system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Signer: kubernetes.io/kube-apiserver-client-kubelet Status: Pending Subject: Common Name: system:node:openshift-worker-1 Serial Number: Organization: system:nodes Events: <none>
oc get csr csr-7hftm -o yaml apiVersion: certificates.k8s.io/v1 kind: CertificateSigningRequest metadata: creationTimestamp: "2023-10-24T14:22:39Z" generateName: csr- name: csr-7hftm resourceVersion: "96957" uid: 84b94213-0c0c-40e4-8f90-d6612fbdab58 spec: groups: - system:serviceaccounts - system:serviceaccounts:openshift-machine-config-operator - system:authenticated request: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlIN01JR2lBZ0VBTUVBeEZUQVRCZ05WQkFvVERITjVjM1JsYlRwdWIyUmxjekVuTUNVR0ExVUVBeE1lYzNsegpkR1Z0T201dlpHVTZiM0JsYm5Ob2FXWjBMWGR2Y210bGNpMHhNRmt3RXdZSEtvWkl6ajBDQVFZSUtvWkl6ajBECkFRY0RRZ0FFMjRabE1JWGE1RXRKSGgwdWg2b3RVYTc3T091MC9qN0xuSnFqNDJKY0dkU01YeTJVb3pIRTFycmYKOTFPZ3pOSzZ5Z1R0Qm16NkFOdldEQTZ0dUszMlY2QUFNQW9HQ0NxR1NNNDlCQU1DQTBnQU1FVUNJRFhHMlFVWQoxMnVlWXhxSTV3blArRFBQaE5oaXhiemJvaTBpQzhHci9kMXRBaUVBdEFDcVVwRHFLYlFUNWVFZXlLOGJPN0dlCjhqVEI1UHN1SVpZM1pLU1R2WG89Ci0tLS0tRU5EIENFUlRJRklDQVRFIFJFUVVFU1QtLS0tLQo= signerName: kubernetes.io/kube-apiserver-client-kubelet uid: c3adb2e0-6d60-4f56-a08d-6b01d3d3c065 usages: - digital signature - client auth username: system:serviceaccount:openshift-machine-config-operator:node-bootstrapper status: {}
Version-Release number of selected component (if applicable):
4.14.0-rc.6
How reproducible:
So far only on one setup
Steps to Reproduce:
1. Deploy dualstack baremetal cluster with day1 networking with static DHCP hostnames 2. 3.
Actual results:
A node fails to join the cluster
Expected results:
All nodes join the cluster
The argument has been deprecated in the v0.14.0 release:
https://github.com/brancz/kube-rbac-proxy/releases/tag/v0.14.0
This is a clone of issue OCPBUGS-27842. The following is the description of the original issue:
—
Current description of HighOverallControlPlaneCPU is wrong for SNO cases and can mislead users. We need to add information regarding SNO clusters to the description of the alert
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
For OCP 4.14+ need to include --enable-defaulting-webhook true to hypershift install command in CI
Reference: https://github.com/openshift/hypershift/pull/2922/files
Slack thread: https://redhat-internal.slack.com/archives/C014N2VLTQE/p1694090399430659
The security team will soon start having the code owners address also CWE (Common Weakness Enumeration). Although this is not a CVE per se it may have security ramifications.
This issue addresses weak MD5 primitive usages in CMO.
This is a clone of issue OCPBUGS-25337. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
In https://github.com/openshift/release/blob/master/ci-operator/step-registry/ipi/install/heterogeneous/ipi-install-heterogeneous-commands.sh#L37-L42, it is downlading yq-v4 from github and use it in the following step.
This will be a potential issue when multiple concurrent jobs are running on the same time, github would deny the access.
We hit ever such issues before, so we installed yq-3.3.0 in upi-installer image, refer to https://github.com/openshift/installer/blob/master/images/installer/Dockerfile.upi.ci.rhel8#L46-L50. Is it possible to migrate the codes to use yq-3.3.0 from upi-installer image?
Before we migrate a lot of ci jobs from arm and amd to multiarch ci, we need to resolve such issues.
cc Lin Wang
Description of problem:
openshift-install is unable to generate an aarch64 iso: FATAL failed to write asset (Agent Installer ISO) to disk: missing boot.catalog file
Version-Release number of selected component (if applicable):
4.16
How reproducible:
100 %
Steps to Reproduce:
1. Create an install_config.yaml with controlplane.architecture and compute.architecture = arm64 2. openshift-install agent create image --log-level debug
Actual results:
DEBUG Generating Agent Installer ISO... INFO Consuming Install Config from target directory DEBUG Purging asset "Install Config" from disk INFO Consuming Agent Config from target directory DEBUG Purging asset "Agent Config" from disk DEBUG initDisk(): start DEBUG initDisk(): regular file FATAL failed to write asset (Agent Installer ISO) to disk: missing boot.catalog file
Expected results:
agent.aarch64.iso is created
Additional info:
Seems to be related to this PR: https://github.com/openshift/installer/pull/7896 boot.catalog is also referenced in the assisted-image-service here: https://github.com/openshift/installer/blob/master/vendor/github.com/openshift/assisted-image-service/pkg/isoeditor/isoutil.go#L155
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
A recent [PR](https://github.com/openshift/hypershift/commit/c030ab66d897815e16d15c987456deab8d0d6da0) updated the kube-apiserver service port to `6443`. That change causes a small outage when upgrading from a 4.13 cluster in IBMCloud. We need to keep the service port as 2040 for IBM Cloud Provider to avoid the outage.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
In this recent PR that merged, a number of API calls do not use caches causing excessive calls.
Done when:
-Change all Get() calls to use listers
-API call metric should decrease
Please review the following PR: https://github.com/openshift/service-ca-operator/pull/221
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/machine-api-provider-azure/pull/88
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When multiple consecutive spaces are present in Pod logs, the spaces are collapsed and white-space is not retained when reviewing logs via the OpenShift Web Console. The white-space is retained when reviewing via the 'raw' output and via the `oc logs` command but the white-space is collapsed when reviewing via the `logs` panel in the OpenShift Web Console. This mangles the output of tables in the logs.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Everytime
Steps to Reproduce:
1. Create a Pod which outputs a table in the logs 2. Review the output table in the Pod logs via the OpenShift Web Console
Actual results:
The spaces in the table are collapsed
Expected results:
The table formatting should be maintained
Additional info:
- During testing, I have added the `white-space:pre` styling for the log lines and this has resolved the white space issues. The styling of the logs do not appear to styled to retain the white-space formatting - Tested on OCP 4.10.53 and 4.13.4 and both have the issue
Description of problem:
google CLI deprecated Python 3.5-3.7 from 448.0.0 causing release ci jobs failed with ERROR: gcloud failed to load. You are running gcloud with Python 3.6, which is no longer supported by gcloud. . specified version to 447.0.0 job link: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/logs/periodic-ci-o[…]cp-upi-f28-destructive/1719562110486188032
Description of problem:
If the `currentConfig` is removed from the master node, the Machine Config Daemon will not recreate it. The logs will say: ~~~ W0726 23:57:35.890645 3013426 daemon.go:1097] Got an error from auxiliary tools: could not get current config from disk: open /etc/machine-config-daemon/currentconfig: no such file or directory ~~~ However, the MCD won't create that currentconfig. Is this desired state? The workaround is to create the correct annotation
Version-Release number of selected component (if applicable):
OpenShift 4.12 and tested on 4.13
How reproducible:
- remove the currentConfig from the node - check the status of the MCD
Steps to Reproduce:
1. 2. 3.
Actual results:
- the currentconfig is missing - stopping the MCD
Expected results:
- if the currentconfig is missing, MCD should reconcile based on the desiredconfig label of the node
Additional info:
Please review the following PR: https://github.com/openshift/prom-label-proxy/pull/357
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Backport of live migration suite in origin to 4.15
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/244
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/openshift-apiserver/pull/406
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-api-provider-baremetal/pull/205
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/machine-api-provider-ibmcloud/pull/24
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cloud-provider-ibm/pull/53
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
If a user changes the providerSpec in CPMS to add an block device for etcd, we need to check that the size is valid or it can result into unhealthy clusters.
What
Via https://gitlab.cee.redhat.com/service/uhc-account-manager/-/merge_requests/4233 OCM has renamed subscription_labels to ocm_subscription and some/many recording rules are likely to be effected for example https://github.com/openshift/telemeter/blob/8f091e8e7ecd3052566bd9dd20eb6991abf762c5/jsonnet/telemeter/rules.libsonnet#L34
How
Update the rules.
Please review the following PR: https://github.com/openshift/azure-file-csi-driver/pull/44
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-policy-controller/pull/143
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-25890. The following is the description of the original issue:
—
Description of problem:
when user clicks on perspective switcher after a hard refresh, the flicker appears
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-25-100326
How reproducible:
Always after user refresh the console
Steps to Reproduce:
1. user login to OCP console 2. refresh the whole console then click perspective switcher 3.
Actual results:
there is flicker when clicking on perspective switcher
Expected results:
no flickers
Additional info:
screen recording https://drive.google.com/file/d/1_2tPZ0DXNTapFP9sSz27vKbnwxxdWZSV/view?usp=drive_link
Please review the following PR: https://github.com/openshift/machine-os-images/pull/30
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The apiserver-url.env file is a dependency of all CCM components. These mostly run on the masters, however, on Azure, they also run on workers. A recent change in kube (https://github.com/kubernetes/kubernetes/pull/121028) means that a previous bug has been fixed that now means that workers no longer bootstrap, since Kubelet no longer sets an IP address. To resolve this issue, we need the CNM to be able to talk to KAS outside of the CNI, this works already on masters, but the url env file is missing on workers so they get stuck.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The agent container image is currently ~770MB. On slow networks, this can take a long time to download, and users don't know why their host isn't being discovered.
Some suggestions from Omer Tuchfeld:
The following test is permafailing (see below for sippy link)
[sig-storage] [Serial] Volume metrics PVC should create metrics for total time taken in volume operations in P/V Controller [Suite:openshift/conformance/serial] [Suite:k8s]
Example failure
The test doesn't seem to always run in serial jobs, but whenever it does run, it fails. And it's often the only test that fails in the run. This only started a few days ago, around the 4th.
Additional context here:
Description of problem:
Nutanix machine without enough memory stuck in Provisioning and machineset scale/delete cannot work
Version-Release number of selected component (if applicable):
Server Version: 4.12.0 4.13.0-0.nightly-2023-01-17-152326
How reproducible:
Always
Steps to Reproduce:
1. Install Nutanix Cluster Template https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/tree/master/functionality-testing/aos-4_12/ipi-on-nutanix//versioned-installer master_num_memory: 32768 worker_num_memory: 16384 networkType: "OVNKubernetes" installer_payload_image: quay.io/openshift-release-dev/ocp-release:4.12.0-x86_64 2. 3. Scale up the cluster worker machineset from 2 replicas to 40 replicas 4. Install a Infra machinesets with 3 replicas, and a Workload machinesets with 1 replica Refer to this doc https://docs.openshift.com/container-platform/4.11/machine_management/creating-infrastructure-machinesets.html#machineset-yaml-nutanix_creating-infrastructure-machinesets and config the following resource VCPU=16 MEMORYMB=65536 MEMORYSIZE=64Gi
Actual results:
1. The new infra machines stuck in 'Provisioning' status for about 3 hours. % oc get machines -A | grep Prov openshift-machine-api qili-nut-big-jh468-infra-48mdt Provisioning 175m openshift-machine-api qili-nut-big-jh468-infra-jnznv Provisioning 175m openshift-machine-api qili-nut-big-jh468-infra-xp7xb Provisioning 175m 2. Checking the Nutanix web console, I found infra machine 'qili-nut-big-jh468-infra-jnznv' had the following msg " No host has enough available memory for VM qili-nut-big-jh468-infra-48mdt (8d7eb6d6-a71e-4943-943a-397596f30db2) that uses 4 vCPUs and 65536MB of memory. You could try downsizing the VM, increasing host memory, power off some VMs, or moving the VM to a different host. Maximum allowable VM size is approximately 17921 MB " infra machine 'qili-nut-big-jh468-infra-jnznv' is not round infra machine 'qili-nut-big-jh468-infra-xp7xb' is in green without warning. But In must gather I found some error: 03:23:49openshift-machine-apinutanixcontrollerqili-nut-big-jh468-infra-xp7xbFailedCreateqili-nut-big-jh468-infra-xp7xb: reconciler failed to Create machine: failed to update machine with vm state: qili-nut-big-jh468-infra-xp7xb: failed to get node qili-nut-big-jh468-infra-xp7xb: Node "qili-nut-big-jh468-infra-xp7xb" not found 3. Scale down the worker machineset from 40 replicas to 30 replicas can not work. Still have 40 Running worker machines and 40 Ready nodes after about 3 hours. % oc get machinesets -A NAMESPACE NAME DESIRED CURRENT READY AVAILABLE AGE openshift-machine-api qili-nut-big-jh468-infra 3 3 176m openshift-machine-api qili-nut-big-jh468-worker 30 30 30 30 5h1m openshift-machine-api qili-nut-big-jh468-workload 1 1 176m % oc get machines -A | grep worker| grep Running -c 40 % oc get nodes | grep worker | grep Ready -c 40 4. I delete the infra machineset, but the machines still in Provisioning status and won't get deleted % oc delete machineset -n openshift-machine-api qili-nut-big-jh468-infra machineset.machine.openshift.io "qili-nut-big-jh468-infra" deleted % oc get machinesets -A NAMESPACE NAME DESIRED CURRENT READY AVAILABLE AGE openshift-machine-api qili-nut-big-jh468-worker 30 30 30 30 5h26m openshift-machine-api qili-nut-big-jh468-workload 1 1 3h21m % oc get machines -A | grep -v Running NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api qili-nut-big-jh468-infra-48mdt Provisioning 3h22m openshift-machine-api qili-nut-big-jh468-infra-jnznv Provisioning 3h22m openshift-machine-api qili-nut-big-jh468-infra-xp7xb Provisioning 3h22m openshift-machine-api qili-nut-big-jh468-workload-qdkvd 3h22m
Expected results:
The new infra machines should be either Running or Failed. Cluster worker machinest scaleup and down should not be impacted.
Additional info:
must-gather download url will be added to the comment.
Please review the following PR: https://github.com/openshift/must-gather/pull/381
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-27027. The following is the description of the original issue:
—
W0109 17:47:02.340203 1 builder.go:109] graceful termination failed, controllers failed with error: failed to get infrastructure name: infrastructureName not set in infrastructure 'cluster'
Description of problem:
The following "error" shows up when running a gcp destroy: Invalid instance ci-op-nlm7chi8-8411c-4tl9r-master-0 in target pool af84a3203fc714c64a8043fdc814386f, target pool will not be destroyed" It is a bit misleading as this alerts when the resource is simply not part of the cluster.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
As a developer, I want to be able to:
so that I can achieve
Description of criteria:
Please review the following PR: https://github.com/openshift/machine-api-provider-gcp/pull/58
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
A change to how Power VS Workspaces are queried is not compatible with the version of terraform-provider-ibm
Version-Release number of selected component (if applicable):
How reproducible:
Easily
Steps to Reproduce:
1. Try to deploy with Power VS 2. Fail with an error stating that [ERROR] Error retrieving service offering: ServiceDoesnotExist: Given service : "power-iaas" doesn't exist
Actual results:
Fail with [ERROR] Error retrieving service offering: ServiceDoesnotExist: Given service : "power-iaas" doesn't exist
Expected results:
Install should succeed.
Additional info:
This bug has been seen during the analysis of another issue
If the Server Internal IP is not defined, CBO crashes as nil is not handled in https://github.com/openshift/cluster-baremetal-operator/blob/release-4.12/provisioning/utils.go#L99
I0809 17:33:09.683265 1 provisioning_controller.go:540] No Machines with cluster-api-machine-role=master found, set provisioningMacAddresses if the metal3 pod fails to start I0809 17:33:09.690304 1 clusteroperator.go:217] "new CO status" reason=SyncingResources processMessage="Applying metal3 resources" message="" I0809 17:33:10.488862 1 recorder_logging.go:37] &Event{ObjectMeta:{dummy.1779c769624884f4 dummy 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[] map[] [] [] []},InvolvedObject:ObjectReference{Kind:Pod,Namespace:dummy,Name:dummy,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,},Reason:ValidatingWebhookConfigurationUpdated,Message:Updated ValidatingWebhookConfiguration.admissionregistration.k8s.io/baremetal-operator-validating-webhook-configuration because it changed,Source:EventSource{Component:,Host:,},FirstTimestamp:2023-08-09 17:33:10.488745204 +0000 UTC m=+5.906952556,LastTimestamp:2023-08-09 17:33:10.488745204 +0000 UTC m=+5.906952556,Count:1,Type:Normal,EventTime:0001-01-01 00:00:00 +0000 UTC,Series:nil,Action:,Related:nil,ReportingController:,ReportingInstance:,} panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1768fd4] goroutine 574 [running]: github.com/openshift/cluster-baremetal-operator/provisioning.getServerInternalIP({0x1e774d0?, 0xc0001e8fd0?}) /go/src/github.com/openshift/cluster-baremetal-operator/provisioning/utils.go:75 +0x154 github.com/openshift/cluster-baremetal-operator/provisioning.GetIronicIP({0x1ea2378?, 0xc000856840?}, {0x1bc1f91, 0x15}, 0xc0004c4398, {0x1e774d0, 0xc0001e8fd0}) /go/src/github.com/openshift/cluster-baremetal-operator/provisioning/utils.go:98 +0xfb
This is a clone of issue OCPBUGS-25441. The following is the description of the original issue:
—
Description of problem:
Oh no! Something went wrong" in Topology -> Observese Tab
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-14-115151
How reproducible:
Always
Steps to Reproduce:
1.Navigate to Topology -> click one deployment and go to Observer Tab 2. 3.
Actual results:
The page crushed ErrorDescription:Component trace:Copy to clipboardat te (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-plugins-shared~main-chunk-b3bd2b20c770a4e73b50.min.js:31:9773) at j (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-plugins-shared~main-chunk-b3bd2b20c770a4e73b50.min.js:12:3324) at div at s (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-c9c3c11a060d045a85da.min.js:60:70124) at div at g (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-c9c3c11a060d045a85da.min.js:6:11163) at div at d (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-c9c3c11a060d045a85da.min.js:1:174472) at t.a (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/dev-console/code-refs/topology-chunk-769d28af48dd4b29136f.min.js:1:487478) at t.a (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/dev-console/code-refs/topology-chunk-769d28af48dd4b29136f.min.js:1:486390) at div at l (https://console-openshift-console.apps.qe-uidaily-1215.qe.devcluster.openshift.com/static/vendor-patternfly-5~main-chunk-c9c3c11a060d045a85da.min.js:60:106304) at div
Expected results: {code:none} not crush
Additional info:
This is a clone of issue OCPBUGS-29003. The following is the description of the original issue:
—
Description of problem:
After upgrade from 4.13.x to 4.14.10, the workload images that the customer stored inside the internal registry are lost, resulting the applications pods into error "Back-off pulling image". Even when manually pulling with podman, it fails then with "manifest unknown" because the image cannot be found in the registry anymore. - This behavior was found and reproduced 100% on ARO clusters, where the internal registry is by default backed up by the Storage Account created by the ARO RP service principal, which is the Containers blob service. - I do not know if in non-managed Azure clusters or any other architecture the same behavior is found.
Version-Release number of selected component (if applicable):
4.14.10
How reproducible:
100% with an ARO cluster (Managed cluster)
Steps to Reproduce: Attached.
The workaround found so far is to rebuild the apps or re-import the images. But those tasks are lengthy and costly specially if it is a production cluster.
Please review the following PR: https://github.com/openshift/prometheus-alertmanager/pull/75
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/operator-framework/operator-marketplace/pull/539
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/oauth-proxy/pull/269
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
security-groups.yaml playbook runs the IPv6 security group rules creation tasks regardless of the os_subnet6 value. The when clause is not considering the os_subnet6 [1] value and is always executed.
It works with:
- name: 'Create security groups for IPv6' block: - name: 'Create master-sg IPv6 rule "OpenShift API"' [...] when: os_subnet6 is defined
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-11-033133
How reproducible:
Always
Steps to Reproduce:
1. Don't set the os_subnet6 in the inventory file [2] (so it's not dual-stack) 2. Deploy 4.15 UPI by running the UPI playbooks
Actual results:
IPv6 security group rules are created
Expected results:
IPv6 security group rules shouldn't be created
Additional info:
[1] https://github.com/openshift/installer/blob/46fd66272538c350327880e1ed261b70401b406e/upi/openstack/security-groups.yaml#L375
[2] https://github.com/openshift/installer/blob/46fd66272538c350327880e1ed261b70401b406e/upi/openstack/inventory.yaml#L77
Please review the following PR: https://github.com/openshift/images/pull/148
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-image-registry-operator/pull/966
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-25396. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The MCO's gcp-e2e-op-single-node job https://prow.ci.openshift.org/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-single-node has been failing consistently since early Jan. It always fails on TestKernelArguments but that happens to be the first time where it gets the node to reboot, after which the node never comes up, so we don't get must-gather and (for some reason) don't get any console gathers either. This is only 4.16 and only single node. Doing the same test on HA gcp clusters yield no issues. The test itself doesn't seem to matter as the next test would fail the same way if it was skipped. This can be reproduced so far only via a 4.16 clusterbot cluster.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
100%
Steps to Reproduce:
1. install SNO 4.16 cluster 2. run MCO's TestKernelArguments 3.
Actual results:
Node never comes back up
Expected results:
Test passes
Additional info:
Description of problem:
Once the annotation or labels modals are opened, any changes to the underlying resources will not be reflected in the modal.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. Log into a cluter as kubeadmin via cli and console 2. Create a project named test 3. Vist the namespaces list page in the console (Administration > Namespaces) 4. Click "Edit annotations" via the kebab menu for namespace "foo" 5. From the cli, run the command: oc annotate namespace test foo=bar 6. Observe that the annotation modal did not update 7. Click cancel to close the annoatation modal 8. Open the annoation modal again and observe that the annoation added from the cli is now shown. 9. Repeat 5 - 8 using the labels modal and the command: oc label namespace test baz=qux
Actual results:
Annotation and labels modals do not update when the underlying resource labels or annotations change.
Expected results:
We should handle this case in some way
Additional info:
We can't necessarily just update the currently displayed data, as this could cause data loss or conflicts. The current behavior can also cause data loss in this situation: - user opens modal - a background update to annotations/modals occur - user makes their own change and saves - The annotations/labels from the background update are lost/squashed
Description of problem:
It failed to configure oauth identity providers in the HostedCluster when accessTokenInactivityTimeout is not set
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Create a cluster 2. configure htpasswd without accessTokenInactivityTimeout field in the HostedCluster CR 3. it failed to apply
Actual results:
jiezhao-mac:hypershift jiezhao$ oc get hostedcluster -n clusters -o yaml > cluster.yaml spec: configuration: oauth: identityProviders: - htpasswd: fileData: name: htpass-secret mappingMethod: claim name: my_htpasswd_provider type: HTPasswd secretRefs: - name: htpass-secret jiezhao-mac:hypershift jiezhao$ oc apply -f cluster.yaml The HostedCluster "jie-test" is invalid: spec.configuration.oauth: Invalid value: "object": no such key: tokenConfig evaluating rule: spec.configuration.oauth.tokenConfig.accessTokenInactivityTimeout minimum acceptable token timeout value is 300 seconds
Expected results:
htpasswd should be configured successfully without accessTokenInactivityTimeout field
Additional info:
When accessTokenInactivityTimeout it set to 300s, htpasswd is configured in the HostedCluster successfully. jiezhao-mac:hypershift jiezhao$ oc get hostedcluster -n clusters -o yaml > cluster.yaml spec: configuration: oauth: identityProviders: - htpasswd: fileData: name: htpass-secret mappingMethod: claim name: my_htpasswd_provider type: HTPasswd tokenConfig: accessTokenInactivityTimeout: 300s secretRefs: - name: htpass-secret jiezhao-mac:hypershift jiezhao$ oc apply -f cluster.yaml hostedcluster.hypershift.openshift.io/jie-test configured jiezhao-mac:hypershift jiezhao$ jiezhao-mac:hypershift jiezhao$ oc get hostedcluster/jie-test -n clusters -ojsonpath='{.spec.configuration}' | jq { "oauth": { "identityProviders": [ { "htpasswd": { "fileData": { "name": "htpass-secret" } }, "mappingMethod": "claim", "name": "my_htpasswd_provider", "type": "HTPasswd" } ], "tokenConfig": { "accessTokenInactivityTimeout": "300s" } } }
Description of problem:
The HyperShift Operator does not guarantee that two request serving nodes will be labeled with the HCP's namespace-name. It is likely that it labels the nodes initially and then doesn't notice if the nodes get deleted by something else.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Create a HCP with dedicated request serving nodes 2. Delete one of the request serving nodes (via deleting the node directly or its machine) 3. Observe that the replacement node does not have the required label for scheduling its request-serving pods
Actual results:
HCP's can exist without two nodes labeled with the HCP's name, causing the kube-apiserver pods to be unschedulable
❯ k get no -lhypershift.openshift.io/cluster=ocm-staging-26ljge23ub1112ve884u0opvkj2c4lpc-perf-rhcp-0012 NAME STATUS ROLES AGE VERSION ip-10-0-34-188.us-east-2.compute.internal Ready worker 9h v1.27.6+1648878
❯ k get po -n ocm-staging-26ljge23ub1112ve884u0opvkj2c4lpc-perf-rhcp-0012 -lapp=kube-apiserver -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kube-apiserver-54854bcb7-v88dq 0/5 Pending 0 151m <none> <none> <none> <none> kube-apiserver-54854bcb7-x5jqt 5/5 Running 0 3h2m 10.128.236.6 ip-10-0-34-188.us-east-2.compute.internal <none> <none>
Expected results:
Every HCP has two nodes labeled with the HCP's name
❯ k get po -n ocm-staging-26ljip0ck3d2i1bejp2sipio4okhgttn-perf-rhcp-0017 -l app=kube-apiserver -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES kube-apiserver-5f85cd4b-l57qr 5/5 Running 0 169m 10.128.218.6 ip-10-0-114-35.us-east-2.compute.internal <none> <none> kube-apiserver-5f85cd4b-lqfsx 5/5 Running 0 169m 10.128.129.6 ip-10-0-59-232.us-east-2.compute.internal <none> <none>
❯ k get no -lhypershift.openshift.io/cluster=ocm-staging-26ljip0ck3d2i1bejp2sipio4okhgttn-perf-rhcp-0017 NAME STATUS ROLES AGE VERSION ip-10-0-114-35.us-east-2.compute.internal Ready worker 24h v1.27.6+1648878 ip-10-0-59-232.us-east-2.compute.internal Ready worker 5d2h v1.27.6+1648878
Additional info:
This is a clone of issue OCPBUGS-25125. The following is the description of the original issue:
—
Description of problem:
The `aws-ebs-csi-driver-node-` appears to be failing to deploy way too often in the CI recently
Version-Release number of selected component (if applicable):
4.14
How reproducible:
in a statistically significant pattern
Steps to Reproduce:
1. run OCP test suite many times for it to matter
Actual results:
fail [github.com/openshift/origin/test/extended/authorization/scc.go:76]: 1 pods failed before test on SCC errors Error creating: pods "aws-ebs-csi-driver-node-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted-v2: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[5]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, provider restricted-v2: .containers[0].privileged: Invalid value: true: Privileged containers are not allowed, provider restricted-v2: .containers[0].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[0].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider restricted-v2: .containers[1].privileged: Invalid value: true: Privileged containers are not allowed, provider restricted-v2: .containers[1].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[1].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider restricted-v2: .containers[2].hostNetwork: Invalid value: true: Host network is not allowed to be used, provider restricted-v2: .containers[2].containers[0].hostPort: Invalid value: 10300: Host ports are not allowed to be used, provider "restricted": Forbidden: not usable by user or serviceaccount, provider "nonroot-v2": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork-v2": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount] for DaemonSet.apps/v1/aws-ebs-csi-driver-node -n openshift-cluster-csi-drivers happened 4 times
Expected results:
Test pass
Additional info:
[sig-auth][Feature:SCC][Early] should not have pod creation failures during install [Suite:openshift/conformance/parallel]
Please review the following PR: https://github.com/openshift/cluster-dns-operator/pull/396
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Currently when attempting installs, our infrastructure "cluster" CR is containing resource with a path such as:
"Workspace.ResourcePool: /DEVQEdatacenter/host/DEVQEcluster//Resources
This is causing issue with CPMSO resulting in rollout of new masters after initial control plane is established
I0219 07:46:32.076950 1 updates.go:478] "msg"="Machine requires an update" "controller"="controlplanemachineset" "diff"=["Workspace.ResourcePool: /DEVQEdatacenter/host/DEVQEcluster//Resources != /DEVQEdatacenter/host/DEVQEcluster/Resources"] "index"=2 "name"="sgao-devqe-vblw8-master-2" "namespace"="openshift-machine-api" "reconcileID"="5f47f5a5-0a90-4168-bfcc-dae0fad9b953" "updateStrategy"="RollingUpdate"
RHEL 9.3 broke at least ironic when it rebased python-dns to 2.3.0
dnspython 2.3.0 raised AttributeError: module 'dns.rdtypes' has no attribute 'ANY' https://github.com/eventlet/eventlet/issues/781
Please review the following PR: https://github.com/openshift/cluster-api-provider-vsphere/pull/26
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
MCO installs resolve-prepender NetworkManager script on the nodes. In order to find out node details it needs to pull baremetalRuntimeCfgImage. However, this image needs to be pulled just the first time, in the followup attempts this script just verifies that this image is available.
This is not desirable in situations where mirror / quay are not available or having a temporary problem - these kind of issues should not prevent the node from starting kubelet. During certificate rotation testing I noticed that the node with a significant time skew won't start kubelet, as it tries to pull baremetalRuntimeCfgImage for kubelet to start - but the image is already on the nodes and it doesn't need refreshing.
This is a clone of issue OCPBUGS-26014. The following is the description of the original issue:
—
While testing oc adm upgrade status against b02, I noticed some COs do not have any annotations, while I expected them to have the include/exclude.release.openshift.io/* ones (to recognize COs that come from the payload).
$ b02 get clusteroperator etcd -o jsonpath={.metadata.annotations} $ ota-stage get clusteroperator etcd -o jsonpath={.metadata.annotations} {"exclude.release.openshift.io/internal-openshift-hosted":"true","include.release.openshift.io/self-managed-high-availability":"true","include.release.openshift.io/single-node-developer":"true"}
CVO does not reconcile CO resources once they exist, only precreates them but does not touch them once they exist. Build02 does not have CO with reconciled metadata because it was born as 4.2 which (AFAIK) is before OCP started to use the exclude/include annotations.
4.16 (development branch)
deterministic
1. delete an annotation on a ClusterOperator resource
The annotation wont be recreated
The annotation should be recreated
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
We are seeing flakes on CNO pod restarts flake in hypershift CI on the hypershift control plane
W0905 11:42:53.359515 1 builder.go:106] graceful termination failed, controllers failed with error: failed to get infrastructure name: infrastructureName not set in infrastructure 'cluster'
The current backoff is set to retry.DefaultBackoff which is appropriate for 409 conflicts and only retries for < 1s
var DefaultBackoff = wait.Backoff{
Steps: 4,
Duration: 10 * time.Millisecond,
Factor: 5.0,
Jitter: 0.1,
}
Elsewhere in the codebase, retry.DefaultBackoff is used with retry.RetryOnConflict() where it is appropriate, but we need to retry for much longer here and much less frequently.
This is a clone of issue OCPBUGS-29115. The following is the description of the original issue:
—
Description of problem:
Trying to run without --node-upgrade-type param fails for "spec.management.upgradeType: Unsupported value: \"\": supported values: \"Replace\", \"InPlace\"" although in --help it is documented to have a default value of 'InPlace'
Version-Release number of selected component (if applicable):
[kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp -v hcp version openshift/hypershift: af9c0b3ce9c612ec738762a8df893c7598cbf157. Latest supported OCP: 4.15.0 [
How reproducible:
happens all the time
Steps to Reproduce:
1.on an hosted cluster setup run : [kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2 --node-upgrade-type Replace --help Creates basic functional NodePool resources for Agent platformUsage: hcp create nodepool agent [flags]Flags: -h, --help help for agentGlobal Flags: --cluster-name string The name of the HostedCluster nodes in this pool will join. (default "example") --name string The name of the NodePool. --namespace string The namespace in which to create the NodePool. (default "clusters") --node-count int32 The number of nodes to create in the NodePool. (default 2) --node-upgrade-type UpgradeType The NodePool upgrade strategy for how nodes should behave when upgraded. Supported options: Replace, InPlace (default ) --release-image string The release image for nodes; if this is empty, defaults to the same release image as the HostedCluster. --render Render output as YAML to stdout instead of applying. 2.try to run with default value of --node-upgrade-type: [kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2
Actual results:
[kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2 2024-02-06T19:57:03+02:00 ERROR Failed to create nodepool {"error": "NodePool.hypershift.openshift.io \"nodepool-of-extra1\" is invalid: spec.management.upgradeType: Unsupported value: \"\": supported values: \"Replace\", \"InPlace\""} github.com/openshift/hypershift/cmd/nodepool/core.(*CreateNodePoolOptions).CreateRunFunc.func1 /home/kni/hypershift_working/hypershift/cmd/nodepool/core/create.go:39 github.com/spf13/cobra.(*Command).execute /home/kni/hypershift_working/hypershift/vendor/github.com/spf13/cobra/command.go:983 github.com/spf13/cobra.(*Command).ExecuteC /home/kni/hypershift_working/hypershift/vendor/github.com/spf13/cobra/command.go:1115 github.com/spf13/cobra.(*Command).Execute /home/kni/hypershift_working/hypershift/vendor/github.com/spf13/cobra/command.go:1039 github.com/spf13/cobra.(*Command).ExecuteContext /home/kni/hypershift_working/hypershift/vendor/github.com/spf13/cobra/command.go:1032 main.main /home/kni/hypershift_working/hypershift/product-cli/main.go:60 runtime.main /home/kni/hypershift_working/go/src/runtime/proc.go:250 Error: NodePool.hypershift.openshift.io "nodepool-of-extra1" is invalid: spec.management.upgradeType: Unsupported value: "": supported values: "Replace", "InPlace" NodePool.hypershift.openshift.io "nodepool-of-extra1" is invalid: spec.management.upgradeType: Unsupported value: "": supported values: "Replace", "InPlace"
Expected results:
should pass as if your adding the param : [kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2 --node-upgrade-type InPlace NodePool nodepool-of-extra1 created [kni@ocp-edge119 ~]$
Additional info:
A related issue is that we have a difference if the --help is used with other parameters or not : [kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --cluster-name hosted-0 --name nodepool-of-extra1 --node-count 2 --node-upgrade-type Replace --help > long.help.out [kni@ocp-edge119 ~]$ ~/hypershift_working/hypershift/bin/hcp create nodepool agent --help > short.help.out [kni@ocp-edge119 ~]$ diff long.help.out short.help.out 14c14 < --node-upgrade-type UpgradeType The NodePool upgrade strategy for how nodes should behave when upgraded. Supported options: Replace, InPlace (default ) --- > --node-upgrade-type UpgradeType The NodePool upgrade strategy for how nodes should behave when upgraded. Supported options: Replace, InPlace [kni@ocp-edge119 ~]$
Description of problem:
When executing oc mirror using an oci path, you can end up with in an error state when the destination is a file://<path> destination (i.e. mirror to disk).
Version-Release number of selected component (if applicable):
4.14.2
How reproducible:
always
Steps to Reproduce:
At IBM we use the ibm-pak tool to generate a OCI catalog, but this bug is reproducible using a simple skopeo copy. Once you've copied the image locally you can move it around using file system copy commands to test this in different ways. 1. Make a directory structure like this to simulate how ibm-pak creates its own catalogs. The problem seems to be related to the path you use, so this represents the failure case: mkdir -p /root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list 2. make a location where the local storage will live: mkdir -p /root/.ibm-pak/oc-mirror-storage 3. Next, copy the image locally using skopeo: skopeo copy docker://icr.io/cpopen/ibm-zcon-zosconnect-catalog@sha256:8d28189637b53feb648baa6d7e3dd71935656a41fd8673292163dd750ef91eec oci:///root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list --all --format v2s2 4. You can copy the OCI catalog content to a location where things will work properly so you can see a working example: cp -r /root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list /root/ibm-zcon-zosconnect-catalog 5. You'll need an ISC... I've included both the oci references in the example (the commented out one works, but the oci:///root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list reference fails). kind: ImageSetConfiguration apiVersion: mirror.openshift.io/v1alpha2 mirror: operators: - catalog: oci:///root/.ibm-pak/data/publish/latest/catalog-oci/manifest-list #- catalog: oci:///root/ibm-zcon-zosconnect-catalog packages: - name: ibm-zcon-zosconnect channels: - name: v1.0 full: true targetTag: 27ba8e targetCatalog: ibm-catalog storageConfig: local: path: /root/.ibm-pak/oc-mirror-storage 6. run oc mirror (remember the ISC has oci refs for good and bad scenarios). You may want to change your working directory to different locations between running the good/bad examples. oc mirror --config /root/.ibm-pak/data/publish/latest/image-set-config.yaml "file://zcon --dest-skip-tls --max-per-registry=6
Actual results:
Logging to .oc-mirror.log Found: zcon/oc-mirror-workspace/src/publish Found: zcon/oc-mirror-workspace/src/v2 Found: zcon/oc-mirror-workspace/src/charts Found: zcon/oc-mirror-workspace/src/release-signatures error: ".ibm-pak/data/publish/latest/catalog-oci/manifest-list/kubebuilder/kube-rbac-proxy@sha256:db06cc4c084dd0253134f156dddaaf53ef1c3fb3cc809e5d81711baa4029ea4c" is not a valid image reference: invalid reference format
Expected results:
Simple example where things were working with the oci:///root/ibm-zcon-zosconnect-catalog reference (this was executed in the same workspace so no new images were detected). Logging to .oc-mirror.log Found: zcon/oc-mirror-workspace/src/publish Found: zcon/oc-mirror-workspace/src/v2 Found: zcon/oc-mirror-workspace/src/charts Found: zcon/oc-mirror-workspace/src/release-signatures 3 related images processed in 668.063974ms Writing image mapping to zcon/oc-mirror-workspace/operators.1700092336/manifests-ibm-zcon-zosconnect-catalog/mapping.txt No new images detected, process stopping
Additional info:
I debugged the error that happened and captured one of the instances where the ParseReference call fails. This is only for reference to help narrow down the issue. github.com/openshift/oc/pkg/cli/image/imagesource.ParseReference (/root/go/src/openshift/oc-mirror/vendor/github.com/openshift/oc/pkg/cli/image/imagesource/reference.go:111) github.com/openshift/oc-mirror/pkg/image.ParseReference (/root/go/src/openshift/oc-mirror/pkg/image/image.go:79) github.com/openshift/oc-mirror/pkg/cli/mirror.(*MirrorOptions).addRelatedImageToMapping (/root/go/src/openshift/oc-mirror/pkg/cli/mirror/fbc_operators.go:194) github.com/openshift/oc-mirror/pkg/cli/mirror.(*OperatorOptions).plan.func3 (/root/go/src/openshift/oc-mirror/pkg/cli/mirror/operator.go:575) golang.org/x/sync/errgroup.(*Group).Go.func1 (/root/go/src/openshift/oc-mirror/vendor/golang.org/x/sync/errgroup/errgroup.go:75) runtime.goexit (/usr/local/go/src/runtime/asm_amd64.s:1594) Also, I wanted to point out that because we use a period in the path (i.e. .ibm-pak) I wonder if that's causing the issue? This is just a guess and something to consider. *FOLLOWUP* ... I just removed the period from ".ibm-pak" and that seemed to make the error go away.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The Ingress Operator should use granular roles in its CredentialsRequest per CCO-249. A change to use granular roles merged after the release-4.15 branch cut. This change needs to be backported for 4.15.0.
4.15.0
Easily.
1. Launch an OCP 4.15 cluster on GCP.
2. Check the ingress operator's CredentialsRequest: oc get -n openshift-cloud-credential-operator credentialsrequests/openshift-ingress-gcp -o yaml
The CredentialsRequest uses a predefined role:
spec: providerSpec: apiVersion: cloudcredential.openshift.io/v1 kind: GCPProviderSpec predefinedRoles: - roles/dns.admin
The CredentialsRequest should specify the individual permissions that the operator requires:
spec: providerSpec: apiVersion: cloudcredential.openshift.io/v1 kind: GCPProviderSpec permissions: - dns.changes.create - dns.resourceRecordSets.create - dns.resourceRecordSets.update - dns.resourceRecordSets.delete - dns.resourceRecordSets.list
https://github.com/openshift/cluster-ingress-operator/pull/844 merged in the master branch for 4.16 and needs to be backported to the release-4.15 branch.
Please review the following PR: https://github.com/openshift/cluster-olm-operator/pull/31
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
After control plane release upgrade, and controlPlaneRelease field is removed in the HostedCluster CR, only capi-provider, cluster-api and control-plane-operator are restarted and run release image, other components are not restarted and still run control plane release image
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. create a cluster in 4.14-2023-09-06-180503 2. control plane release upgrade to 4.14-2023-09-07-180503 3. remove controlPlaneRelease in the HostedCluster CR 4. check all pods/containers images in the control plane namespace
Actual results:
only capi-provider, cluster-api and control-plane-operator are restarted and run release image 4.14-2023-09-06-180503, other components are not restarted and still run control plane release image 4.14-2023-09-07-180503. jiezhao-mac:hypershift jiezhao$ oc get hostedcluster -n clusters NAME VERSION KUBECONFIG PROGRESS AVAILABLE PROGRESSING MESSAGE jie-test 4.14.0-0.ci-2023-09-06-180503 jie-test-admin-kubeconfig Completed True False The hosted control plane is available jiezhao-mac:hypershift jiezhao$ - lastTransitionTime: "2023-09-08T01:54:54Z" message: '[cluster-api deployment has 1 unavailable replicas, control-plane-operator deployment has 1 unavailable replicas]' observedGeneration: 5 reason: UnavailableReplicas status: "True" type: Degraded
Expected results:
The control plane should return to release image 4.14-2023-09-06-180503 with all components in a healthy state.
Additional info:
Please review the following PR: https://github.com/openshift/cluster-api-provider-alibaba/pull/46
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/azure-disk-csi-driver/pull/64
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-27779. The following is the description of the original issue:
—
Description of problem:
When expanding a PVC of unit-less size (e.g., '2147483648'), the Expand PersistentVolumeClaim modal populates the spinner with a unit-less value (e.g., 2147483648) instead of a meaningful value.
Version-Release number of selected component (if applicable):
CNV - 4.14.3
How reproducible:
always
Steps to Reproduce:
1.Create a PVC using the following YAML.
apiVersion: v1 kind: PersistentVolumeClaim metadata: name: task-pv-claim spec: storageClassName: gp3-csi accessModes: - ReadWriteOnce resources: requests: storage: "2147483648"
apiVersion: v1 kind: Pod metadata: name: task-pv-pod spec: securityContext: runAsNonRoot: true seccompProfile: type: RuntimeDefault volumes: - name: task-pv-storage persistentVolumeClaim: claimName: task-pv-claim containers: - name: task-pv-container image: nginx ports: - containerPort: 80 name: "http-server" volumeMounts: - mountPath: "/usr/share/nginx/html" name: task-pv-storage
2. From the newly created PVC details page, Click Actions > Expand PVC. 3. Note the value in the spinner input.
See https://drive.google.com/file/d/1toastX8rCBtUzx5M-83c9Xxe5iPA8fNQ/view for a demo
This is a clone of issue OCPBUGS-28836. The following is the description of the original issue:
—
Description of problem:
Usernames can contain all kinds of characters that are not allowed in resource names. Hash the name instead and use hex representation of the result to get a usable identifier.
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always
Steps to Reproduce:
1. log in to the web console configured with a login to a 3rd party OIDC provider 2. go to the User Preferences page / check the logs in the javascript console
Actual results:
The User Preferences page shows empty values instead of defaults. The javascript console reports things like ``` consoleFetch failed for url /api/kubernetes/api/v1/namespaces/openshift-console-user-settings/configmaps/user-settings-kubeadmin r: configmaps "user-settings-kubeadmin" not found ```
Expected results:
I am able to persist my user preferences.
Additional info:
Please review the following PR: https://github.com/openshift/route-controller-manager/pull/30
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/259
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-update-keys/pull/53
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/kubernetes-kube-storage-version-migrator/pull/199
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The existing tables that have hard-coded PF5 classnames don't display table headers at mobile resolutions. This is because of the inclusion of `pf-m-grid-md` alongside `pf-v5-c-table`. We should remove `pf-m-grid-md` to preserve the functionality it was prior to the PF5 upgrade.
This is a clone of issue OCPBUGS-24421. The following is the description of the original issue:
—
Description of problem:
[vSphere-CSI-Driver-Operator] does not update the VSphereCSIDriverOperatorCRAvailable status timely
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-04-162702
How reproducible:
Always
Steps to Reproduce:
1. Set up a vSphere cluster with 4.15 nightly; 2. Backup the secret/vmware-vsphere-cloud-credentials to "vmware-cc.yaml" 3. Change the secret/vmware-vsphere-cloud-credentials password to an invalid value under ns/openshift-cluster-csi-drivers by oc edit; 4. Wait for the cluster storage operator degrade and the driver controller pods CrashLoopBackOff, then recover the backup secret "vmware-cc.yaml" back by apply; 5. Observer the driver controller pods back to Running and the cluster storage operator should be back to healthy.
Actual results:
In Step5 : The driver controller pods back to Running but the cluster storage operator stuck at Degrade: True status for almost 1 hour$ oc get po NAME READY STATUS RESTARTS AGE vmware-vsphere-csi-driver-controller-664db7d497-b98vt 13/13 Running 0 16s vmware-vsphere-csi-driver-controller-664db7d497-rtj49 13/13 Running 0 23s vmware-vsphere-csi-driver-node-2krg6 3/3 Running 1 (3h4m ago) 3h5m vmware-vsphere-csi-driver-node-2t928 3/3 Running 2 (3h16m ago) 3h16m vmware-vsphere-csi-driver-node-45kb8 3/3 Running 2 (3h16m ago) 3h16m vmware-vsphere-csi-driver-node-8vhg9 3/3 Running 1 (3h16m ago) 3h16m vmware-vsphere-csi-driver-node-9fh9l 3/3 Running 1 (3h4m ago) 3h5m vmware-vsphere-csi-driver-operator-5954476ddc-rkpqq 1/1 Running 2 (3h10m ago) 3h17m vmware-vsphere-csi-driver-webhook-7b6b5d99f6-rxdt8 1/1 Running 0 3h16m vmware-vsphere-csi-driver-webhook-7b6b5d99f6-skcbd 1/1 Running 0 3h16m $ oc get co/storage -w NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE storage 4.15.0-0.nightly-2023-12-04-162702 False False True 8m39s VSphereCSIDriverOperatorCRAvailable: VMwareVSphereControllerAvailable: error logging into vcenter: ServerFaultCode: Cannot complete login due to an incorrect user name or password. storage 4.15.0-0.nightly-2023-12-04-162702 True False False 0s $ oc get co/storage NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE storage 4.15.0-0.nightly-2023-12-04-162702 True False False 3m41s
Expected results:
In Step5 : After driver controller pods back to Running the cluster storage operator should recover healthy status immediatelly
Additional info:
I compare with the previous CI results seems this issue happened after 4.15.0-0.nightly-2023-11-25-110147
Description of problem:
For hcp resources: "cloud-network-config-controller" "multus-admission-controller" "ovnkube-control-plane" no `hypershift.openshift.io/hosted-control-plane:{hostedcluster resource namespace}-{cluster-name}` found in the above hcp resources
Version-Release number of selected component (if applicable):
4.14
How reproducible:
100%
Steps to Reproduce:
1. create a hosted cluster 2. check the labels of those resources e.g. `$ oc get pod multus-admission-controller-7c677c745c-l4dbc -oyaml` to check the labels of it. Or refer testcase: ocp-44988
Actual results:
no expected label found
Expected results:
the pods have the label: `hypershift.openshift.io/hosted-control-plane:{hostedcluster resource namespace}-{cluster-name}`
Additional info:
Description of problem:
a 4.13 cluster installed with baselineCapabilitySet: None additionalEnabledCapabilities: ['NodeTuning', 'CSISnapshot'] an upgrade to 4.14 causing a previously disabled Console to became ImplicitlyEnabled (in contrast with newly added 4.14 capabilities that are expected to be enabled implicitly in this case) 'ImplicitlyEnabledCapabilities' { "lastTransitionTime": "2023-10-09T19:08:29Z", "message": "The following capabilities could not be disabled: Console, ImageRegistry, MachineAPI", "reason": "CapabilitiesImplicitlyEnabled", "status": "True", "type": "ImplicitlyEnabledCapabilities" }
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-10-08-220853
How reproducible:
100%
Steps to Reproduce:
as described above
Additional info:
the root cause appears to be https://github.com/openshift/cluster-kube-apiserver-operator/pull/1542 more info in https://redhat-internal.slack.com/archives/CB48XQ4KZ/p1696940380413289
Description of problem:
- upgrade the cluster - 2 or more kube-apiserver pod do not become online. Network access could be lost due to misconfiguration or wrong rhel update. We can simulate this with: ssh into a node run iptables -A INPUT -p tcp --destination-port 6443 -j DROP - 2 or more kube-apiserver-guard pods lose readiness - kube-apiserver-guard-pdb PDB blocks the node drain because status.currentHealthy is less than status.desiredHealthy - it is not possible to drain the node without overriding eviction requests (forcefully deleting the guard pods)`
Version-Release number of selected component (if applicable):
How reproducible:
100
Steps to Reproduce:
in a description
Actual results:
evicting pod openshift-kube-apiserver/kube-apiserver-guard-ip-10-0-19-181.eu-north-1.compute.internal error when evicting pods/"kube-apiserver-guard-ip-10-0-19-181.eu-north-1.compute.internal" -n "openshift-kube-apiserver" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
Expected results:
it is possible to evict the unready pods
Additional info:
Description of problem:
ART is moving the container images to be built by Golang 1.21. We should do the same to keep our build config in sync with ART.
Version-Release number of selected component (if applicable):
4.16/master
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-api-provider-vsphere/pull/17
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/baremetal-operator/pull/302
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Cannot install singlenamespace operator using web console
Version-Release number of selected component (if applicable):
zhaoxia@xzha-mac doc_add_operator % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-10-08-220853 True False 168m Cluster version is 4.14.0-0.nightly-2023-10-08-220853
How reproducible:
always
Steps to Reproduce:
1.install catsrc zhaoxia@xzha-mac doc_add_operator % cat catsrc-singlenamespace.yaml apiVersion: operators.coreos.com/v1alpha1 kind: CatalogSource metadata: name: nginx-index namespace: openshift-marketplace spec: displayName: Test publisher: OLM-QE sourceType: grpc image: quay.io/olmqe/nginxolm-operator-index:v1-singlenamespace updateStrategy: registryPoll: interval: 10m oc apply -f catsrc-singlenamespace.yaml zhaoxia@xzha-mac doc_add_operator % oc get packagemanifests nginx-operator -o yaml installModes: - supported: false type: OwnNamespace - supported: true type: SingleNamespace - supported: false type: MultiNamespace - supported: false type: AllNamespaces 2. install nginx-operator using web console 3.
Actual results:
nginxolm can't be installed with error message: "nginxolm can't be installed The operator does not support single namespace or global installation modes." The error message confused me, nginx-operator does support SingleNamespace, but the error message said "The operator does not support single namespace or global installation modes."
Expected results:
nginxolm can be installed
Additional info:
The error message confused me, nginx-operator does support SingleNamespace, but the error message said "The operator does not support single namespace or global installation modes."
Issue 51 from https://docs.google.com/spreadsheets/d/1TR3ENY-GE_LQL9F-xH6NahtRHdu6IrYGhLqDDA8y0EI/edit#gid=1035185624
Detail page Action dropdown uses an bolder font now. That is not used on other actions buttons.
Investigation findings: PF5 sets button elements to font-family: inherit and since this button is inside an <h1> it gets RedHatDisplay instead of RedHatText font-family. A quick fix would be to add font-family: var(-pf-v5-globalFontFamily-text) to .co-actions
Screenshots:
Cluster-scoped resources do not need (or want) metadata.namespace defined. Currently the platform-operators-aggregated ClusterOpreator manifest requests a namespace, but that request should be dropped to avoid confusing human and robot readers.
At least 4.15. I haven't dug back to count previous 4.y.
100%
$ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.15.0-ec.1-x86_64 grep -r5 platform-operators-aggregated manifests/ | grep namespace:
manifests/0000_50_cluster-platform-operator-manager_07-aggregated-clusteroperator.yaml- namespace: openshift-platform-operators
No hits.
We should remove all exceptions added over time to https://github.com/openshift/hypershift/blob/860064d33f4729c2db3c68722d0b5a633e6d1bcd/test/e2e/util/util.go#L414
Multi-vcenter and wrong user/password in secret/vmware-vsphere-cloud-credentials causes the vSphere CSI Driver controller pods restarting
Description of problem:
When there are Multi-vcenter in secret/vmware-vsphere-cloud-credentials in ns/openshift-cluster-csi-drivers (see bug https://issues.redhat.com/browse/OCPBUGS-20478), the vSphere CSI Driver controller pods restarting are always restarting.
vmware-vsphere-csi-driver-controller-545dc5679f-mdsjt 0/13 Pending 0 0s vmware-vsphere-csi-driver-controller-545dc5679f-mdsjt 0/13 ContainerCreating 0 0s vmware-vsphere-csi-driver-controller-587f78b9c7-br4gs 0/13 Terminating 0 3s vmware-vsphere-csi-driver-controller-545dc5679f-mdsjt 0/13 Terminating 0 1s vmware-vsphere-csi-driver-controller-587f78b9c7-9pfmp 0/13 Pending 0 0s vmware-vsphere-csi-driver-controller-587f78b9c7-9pfmp 0/13 Pending 0 0s vmware-vsphere-csi-driver-controller-587f78b9c7-9pfmp 0/13 ContainerCreating 0 0s vmware-vsphere-csi-driver-controller-587f78b9c7-qdb89 12/13 Terminating 0 9s vmware-vsphere-csi-driver-controller-b946b657-7t74p 13/13 Terminating 0 9s vmware-vsphere-csi-driver-controller-545dc5679f-mdsjt 0/13 Terminating 0 3s vmware-vsphere-csi-driver-controller-587f78b9c7-qdb89 0/13 Terminating 0 10s vmware-vsphere-csi-driver-controller-545dc5679f-75wfm 12/13 Terminating 0 9s vmware-vsphere-csi-driver-controller-587f78b9c7-9pfmp 0/13 ContainerCreating 0 2s vmware-vsphere-csi-driver-controller-587f78b9c7-qdb89 0/13 Terminating 0 11s vmware-vsphere-csi-driver-controller-587f78b9c7-qdb89 0/13 Terminating 0 11s vmware-vsphere-csi-driver-controller-587f78b9c7-qdb89 0/13 Terminating 0 11s vmware-vsphere-csi-driver-controller-545dc5679f-75wfm 0/13 Terminating 0 10s vmware-vsphere-csi-driver-controller-545dc5679f-75wfm 0/13 Terminating 0 11s vmware-vsphere-csi-driver-controller-545dc5679f-75wfm 0/13 Terminating 0 11s vmware-vsphere-csi-driver-controller-545dc5679f-75wfm 0/13 Terminating 0 11s
$ oc get co storage
storage 4.14.0-0.nightly-2023-10-10-084534 False True False 15s VSphereCSIDriverOperatorCRAvailable: VMwareVSphereDriverControllerServiceControllerAvailable: Waiting for Deployment
$ oc logs -f deployment.apps/vmware-vsphere-csi-driver-controller --tail=500 {"level":"error","time":"2023-10-12T11:40:38.920487342Z","caller":"service/driver.go:189","msg":"failed to init controller. Error: ServerFaultCode: Cannot complete login due to an incorrect user name or password.","TraceId":"5e60e6c5-efeb-4080-888c-74182e4fb1f4","TraceId":"ec636d3d-1ddb-43a5-b9f7-8541dacff583","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).BeforeServe\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/pkg/csi/service/driver.go:189\nsigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).Run\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/pkg/csi/service/driver.go:202\nmain.main\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/cmd/vsphere-csi/main.go:71\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:250"} {"level":"info","time":"2023-10-12T11:40:38.920536779Z","caller":"service/driver.go:109","msg":"Configured: \"csi.vsphere.vmware.com\" with clusterFlavor: \"VANILLA\" and mode: \"controller\"","TraceId":"5e60e6c5-efeb-4080-888c-74182e4fb1f4","TraceId":"ec636d3d-1ddb-43a5-b9f7-8541dacff583"} {"level":"error","time":"2023-10-12T11:40:38.920572294Z","caller":"service/driver.go:203","msg":"failed to run the driver. Err: +ServerFaultCode: Cannot complete login due to an incorrect user name or password.","TraceId":"5e60e6c5-efeb-4080-888c-74182e4fb1f4","stacktrace":"sigs.k8s.io/vsphere-csi-driver/v3/pkg/csi/service.(*vsphereCSIDriver).Run\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/pkg/csi/service/driver.go:203\nmain.main\n\t/go/src/github.com/kubernetes-sigs/vsphere-csi-driver/cmd/vsphere-csi/main.go:71\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:250"}
$ oc logs vmware-vsphere-csi-driver-operator-b4b8d5d56-f76pc I1012 11:43:08.973130 1 event.go:298] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-cluster-csi-drivers", Name:"vmware-vsphere-csi-driver-operator", UID:"a8492b8c-8c13-4b15-aedc-6f3ced80618e", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'DeploymentUpdateFailed' Failed to update Deployment.apps/vmware-vsphere-csi-driver-controller -n openshift-cluster-csi-drivers: Operation cannot be fulfilled on deployments.apps "vmware-vsphere-csi-driver-controller": the object has been modified; please apply your changes to the latest version and try again E1012 11:43:08.996554 1 base_controller.go:268] VMwareVSphereDriverControllerServiceController reconciliation failed: Operation cannot be fulfilled on deployments.apps "vmware-vsphere-csi-driver-controller": the object has been modified; please apply your changes to the latest version and try again W1012 11:43:08.999148 1 driver_starter.go:206] CSI driver can only connect to one vcenter, more than 1 set of credentials found for CSI driver W1012 11:43:09.390489 1 driver_starter.go:206] CSI driver can only connect to one vcenter, more than 1 set of credentials found for CSI driver
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-10-10-084534
How reproducible:
Always
Steps to Reproduce:
See Description
Actual results:
Storage CSI Driver pods are restarting
Expected results:
Storage CSI Driver pods should not restarting
Please review the following PR: https://github.com/openshift/machine-api-provider-aws/pull/93
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
4.14.0 and 4.15.0
How reproducible:
Every time.
Steps to Reproduce:
1. git clone https://github.com/openshift/installer.git 2. export TAGS=aro 3. hack/build.sh 4. export OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE="${RELEASE_IMAGE}" 5. export OPENSHIFT_INSTALL_INVOKER="ARO" 6. Run ccoctl to generate ID resources 7. ./openshift-install create manifests 8. ./openshift-install create cluster --log-level=debug
Actual results:
azure-cloud-provider gets generated with aadClientId = service principal clientID used by the installer.
Expected results:
This step should be skipped and kube-controller-manager should rely on file assets.
Additional info:
Open pull request: https://github.com/openshift/installer/pull/7608
Description of problem:
The installation of OpenShift Container Platform 4.13.4 is failing fairly frequent compare to previous version, when installing with proxy configured. The error reported by the MachineConfigPool is as shown below. - lastTransitionTime: "2023-07-04T10:36:44Z" message: 'Node master0.example.com is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-1e13d7d4ca10669d3d5a6a2bd532873a\" not found", Node master1.example.com is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-1e13d7d4ca10669d3d5a6a2bd532873a\" not found", Node master2.example.com is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-1e13d7d4ca10669d3d5a6a2bd532873a\" not found"' According to https://docs.google.com/document/d/1fgP6Kv1D-75e1Ot0Kg-W2qPyxWDp2_CALltlBLuseec/edit#heading=h.ny6l9ud82fxx this seems to be a known condition but it's not clear how to prevent that from happening and therefore ensure installation are working as expected. The major difference found between /etc/mcs-machine-config-content.json on the OpenShift Container Platform 4 - Control-Plane Node and the rendered-master-${hash} are within the following files. - /etc/mco/proxy.env - /etc/kubernetes/kubelet-ca.crt
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.13.4
How reproducible:
Random
Steps to Reproduce:
1. Install OpenShift Container Platform 4.13.4 on AWS with platform:none, proxy defined and both machineCIDR and machineNetwork.cidr set.
Actual results:
Installation is stuck and will eventually fail as the MachineConfigPool is failing to rollout required MachineConfig for master MachineConfigPool - lastTransitionTime: "2023-07-04T10:36:44Z" message: 'Node master0.example.com is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-1e13d7d4ca10669d3d5a6a2bd532873a\" not found", Node master1.example.com is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-1e13d7d4ca10669d3d5a6a2bd532873a\" not found", Node master2.example.com is reporting: "machineconfig.machineconfiguration.openshift.io \"rendered-master-1e13d7d4ca10669d3d5a6a2bd532873a\" not found"'
Expected results:
Installation to work or else provide meaningful error messaging
Additional info:
https://docs.google.com/document/d/1fgP6Kv1D-75e1Ot0Kg-W2qPyxWDp2_CALltlBLuseec/edit#heading=h.ny6l9ud82fxx checked and then talked to Red Hat Engineering as it was not clear how to proceed
Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/772
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
If the installer using cluster api exits before bootstrap destroy, it may leak processes which continue to run in the background of the host system. These processes may continue to reconcile cloud resources, so the cluster resources would be created and recreated even when you are trying to delete them. This occurs because the installer runs kube-apiserver, etcd, and the capi provider binaries as subprocesses. If the installer exits without shutting down those subprocesses, due to an error or user interrupt, the processes will continue to run in the background. The processes can be identified with the ps command. pgrep and pkill are also useful. Brief discussion here of this occurring in PowerVS: https://redhat-internal.slack.com/archives/C05QFJN2BQW/p1712688922574429
Version-Release number of selected component (if applicable):
How reproducible:
Often
Steps to Reproduce:
1. Run capi-based install (on any platform), by specifying fields below in the install config [0] 2. Wait until CAPI controllers begin to run. This will be easy to identify because the terminal will fill with controller logs. Particularly you should see [1] 3. Once the controllers are running interrupt with CTRL + C [0] Install config for capi install featureGates: - ClusterAPIInstall=true featureSet: CustomNoUpgrade [1] INFO Started local control plane with envtest INFO Stored kubeconfig for envtest in: /c/auth/envtest.kubeconfig INFO Running process: Cluster API with args [-v=2 --metrics-bind-addr=0 --
Actual results:
controllers will leak and continue to run. They can be viewed with ps or pgrep You may also see INFO Shutting down local Cluster API control plane... That means the Shutdown started but did not complete.
Expected results:
The installer should shutdown gracefully and not leak processes, such as: ^CWARNING Received interrupt signal INFO Shutting down local Cluster API control plane... INFO Stopped controller: Cluster API INFO Stopped controller: aws infrastructure provider ERROR failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed to create infrastructure manifest: Post "https://127.0.0.1:41441/apis/infrastructure.cluster.x-k8s.io/v1beta2/awsclustercontrolleridentities": unexpected EOF INFO Local Cluster API system has completed operations
Additional info:
This is not going to be pretty. Likely mostly a re-implementation given the way everything was coded to use regexes that depend on the old locator and keys in specific orders. We need a new way to define matchers that uses structured intervals.
We also have some very complex logic around hashing the message to get it into the locator. Possible duplication between watchevents/event.go and duplicated_events.go.
Will be quite delicate and probably very time consuming.
Ceph storage plugin has moved to it's own repository at https://github.com/red-hat-storage/odf-console
The static plugin has not been used for a few releases and now can be removed safely.
Description of problem:
When the TestMTLSWithCRLs e2e test fails on a curl, it checks the stdout but the stdout could be empty, so it panics: --- FAIL: TestAll/parallel/TestMTLSWithCRLs (97.09s) --- FAIL: TestAll/parallel/TestMTLSWithCRLs/certificate-distributes-its-own-crl (97.09s) panic: runtime error: slice bounds out of range [-3:] [recovered] panic: runtime error: slice bounds out of range [-3:]
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Experience a failure on the MTLS testing, such as seen in https://redhat-internal.slack.com/archives/CBWMXQJKD/p1688596054069399?thread_ts=1688596036.042119&cid=CBWMXQJKD Search.ci shows two failures in the past two weeks: https://search.ci.openshift.org/?search=FAIL%3A+TestAll%2Fparallel%2FTestMTLSWithCRLs&maxAge=336h&context=1&type=bug%2Bissue%2Bjunit&name=cluster-ingress-operator&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Steps to Reproduce:
1. N/A 2. 3.
Actual results:
Test panics when trying to report an error.
Expected results:
Test reports whatever error it can without panics.
Additional info:
stdout was empty, but https://github.com/openshift/cluster-ingress-operator/blob/4c92a6d1ee80b6b120dd750855a40145a530153c/test/e2e/client_tls_test.go#L1587 doesn't check that the value is empty before it tries to index it.
Description of problem:
make verify uses the latest version of setup-envtest, regardless of what go version the repo is currently on
How reproducible:
100%
Steps to Reproduce:
Run `make verify` without a local image of setup-envtest should cause the issue
Actual results:
go: sigs.k8s.io/controller-runtime/tools/setup-envtest@latest: sigs.k8s.io/controller-runtime/tools/setup-envtest@v0.0.0-20240323114127-e08b286e313e requires go >= 1.22.0 (running go 1.21.7; GOTOOLCHAIN=local) Go compliance shim [5685] [rhel-8-golang-1.21][openshift-golang-builder]: Exited with: 1
Expected results:
make verify should be able to run without build errors
Additional info:
Description of problem:
4.15.0-0.nightly-2023-10-06-123200, Prometheus Operator version is 0.68.0, there is "duplicate port definition" warning message in 4.15 prometheus-operator
$ oc logs deployment/prometheus-operator -n openshift-monitoring | grep "duplicate port definition with" -C2 level=info ts=2023-10-08T01:44:40.586511278Z caller=operator.go:655 component=alertmanageroperator key=openshift-monitoring/main msg="sync alertmanager" level=info ts=2023-10-08T01:44:40.626492507Z caller=operator.go:655 component=alertmanageroperator key=openshift-monitoring/main msg="sync alertmanager" level=warn ts=2023-10-08T01:44:40.628520232Z caller=klog.go:96 component=k8s_client_runtime func=Warning msg="spec.template.spec.containers[5].ports[0]: duplicate port definition with spec.template.spec.containers[2].ports[0]" level=info ts=2023-10-08T01:44:40.63072762Z caller=operator.go:1189 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" level=info ts=2023-10-08T01:44:40.91709494Z caller=operator.go:1189 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" -- level=info ts=2023-10-08T01:45:19.85277831Z caller=operator.go:655 component=alertmanageroperator key=openshift-monitoring/main msg="sync alertmanager" level=info ts=2023-10-08T01:45:24.014118091Z caller=operator.go:1189 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" level=warn ts=2023-10-08T01:45:24.256334754Z caller=klog.go:96 component=k8s_client_runtime func=Warning msg="spec.template.spec.containers[5].ports[0]: duplicate port definition with spec.template.spec.containers[2].ports[0]" level=info ts=2023-10-08T01:45:24.259230552Z caller=operator.go:1189 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" level=info ts=2023-10-08T01:45:24.50510448Z caller=operator.go:1189 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" -- level=info ts=2023-10-08T07:33:33.724893975Z caller=operator.go:1310 component=prometheusoperator key=openshift-monitoring/k8s statefulset=prometheus-k8s shard=0 msg="recreating StatefulSet because the update operation wasn't possible" reason="Forbidden: updates to statefulset spec for fields other than 'replicas', 'ordinals', 'template', 'updateStrategy', 'persistentVolumeClaimRetentionPolicy' and 'minReadySeconds' are forbidden" level=info ts=2023-10-08T07:33:35.232445429Z caller=operator.go:1189 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" level=warn ts=2023-10-08T07:33:35.442232343Z caller=klog.go:96 component=k8s_client_runtime func=Warning msg="spec.template.spec.containers[5].ports[0]: duplicate port definition with spec.template.spec.containers[2].ports[0]" level=info ts=2023-10-08T07:33:35.445827197Z caller=operator.go:1189 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus" level=info ts=2023-10-08T07:33:35.708322936Z caller=operator.go:1189 component=prometheusoperator key=openshift-monitoring/k8s msg="sync prometheus"
kube-rbac-proxy-thanos and thanos-sidecar container use the same 10902 port, no functional affect, the warning maybe expected, if so, we could close this bug
$ oc -n openshift-monitoring get sts prometheus-k8s -ojsonpath='{.spec.template.spec.containers[5].ports[0]}' | jq { "containerPort": 10902, "name": "thanos-proxy", "protocol": "TCP" } $ oc -n openshift-monitoring get sts prometheus-k8s -ojsonpath='{.spec.template.spec.containers[2].ports[0]}' | jq { "containerPort": 10902, "name": "http", "protocol": "TCP" } $ oc -n openshift-monitoring get sts prometheus-k8s -ojsonpath='{.spec.template.spec.containers[5].name}' kube-rbac-proxy-thanos $ oc -n openshift-monitoring get sts prometheus-k8s -ojsonpath='{.spec.template.spec.containers[2].name}' thanos-sidecar
checked in 4.14, prometheus-operator versio is 0.67.1 no such issue
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-10-06-234925 True False 3h33m Cluster version is 4.14.0-0.nightly-2023-10-06-234925 $ oc logs deployment/prometheus-operator -n openshift-monitoring | grep "duplicate port definition with" -C2 no result $ oc -n openshift-monitoring get sts prometheus-k8s -ojsonpath='{.spec.template.spec.containers[5].ports[0]}' | jq { "containerPort": 10902, "name": "thanos-proxy", "protocol": "TCP" } $ oc -n openshift-monitoring get sts prometheus-k8s -ojsonpath='{.spec.template.spec.containers[2].ports[0]}' | jq { "containerPort": 10902, "name": "http", "protocol": "TCP" } $ oc -n openshift-monitoring get sts prometheus-k8s -ojsonpath='{.spec.template.spec.containers[5].name}' kube-rbac-proxy-thanos $ oc -n openshift-monitoring get sts prometheus-k8s -ojsonpath='{.spec.template.spec.containers[2].name}' thanos-sidecar
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.0-0.nightly-2023-10-06-123200 True False 7h1m Cluster version is 4.15.0-0.nightly-2023-10-06-123200
How reproducible:
always in 4.15
Steps to Reproduce:
1. check prometheus-operator logs
Actual results:
"duplicate port definition" warning message in 4.15 prometheus-operator
Expected results:
Additional info:
we could close this bug, since it seems it's expected
This is a clone of issue OCPBUGS-10851. The following is the description of the original issue:
—
Currently, the plugin template gives you instructions for running the console using a container image, which is a lightweight to do development and avoids the need to build the console source code from scratch. The image we reference uses a production version of React, however. This means that you aren't able to use the React browser plugin to debug your application.
We should look at alternatives that allow you to use React Developer Tools. Perhaps we can publish a different image that uses a development build. Or at least we need to better document building console locally instead of using an image to allow development builds.
Please review the following PR: https://github.com/openshift/operator-framework-rukpak/pull/34
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
There is a problem with IBM ROKS (managed service) running 4.14+ cluster-storage-operator never sets the upgradeable=True condition, so it shows up as Unknown: - lastTransitionTime: "2023-11-08T19:07:01Z" reason: NoData status: Unknown type: Upgradeable This is a regression from 4.13. In 4.13, pkg/operator/snapshotcrd/controller.go was the one that set `upgradeable: True` upgradeable := operatorapi.OperatorCondition{ Type: conditionsPrefix + operatorapi.OperatorStatusTypeUpgradeable, Status: operatorapi.ConditionTrue, } In the 4.13 bundle from IBM ROKS, these two conditions are set in cluster-scoped-resources/operator.openshift.io/storages/cluster.yaml - lastTransitionTime: "2023-11-08T14:22:21Z" status: "True" type: SnapshotCRDControllerUpgradeable - lastTransitionTime: "2023-11-08T14:22:21Z" reason: AsExpected status: "False" type: SnapshotCRDControllerDegraded So the SnapshotCRDController is running and sets `upgradeable: True` on 4.13. But in the 4.14 bundle, SnapshotCRDController no longer exists. https://github.com/openshift/cluster-storage-operator/pull/385/commits/fa9af3aad65b9d0e9c618453825e4defeaad59ac So in 4.14+ it's pkg/operator/defaultstorageclass/controller.go that should set the condition https://github.com/openshift/cluster-storage-operator/blob/dbb1514dbf9923c56a4a198374cc59e45f9bc0cc/pkg/operator/defaultstorageclass/controller.go#L97-L100 But that only happens if `syncErr == unsupportedPlatformError`... and not if `if syncErr == supportedByCSIError` like the case with the IBM VPC driver. - lastTransitionTime: "2023-11-08T14:22:23Z" message: 'DefaultStorageClassControllerAvailable: StorageClass provided by supplied CSI Driver instead of the cluster-storage-operator' reason: AsExpected status: "True" type: Available So what controller will set `upgradeable: True` for IBM VPC? IBM VPC uses this StatusFilter function for ROKS: https://github.com/openshift/cluster-storage-operator/blob/dbb1514dbf9923c56a4a198374cc59e45f9bc0cc/pkg/operator/csidriveroperator/csioperatorclient/ibm-vpc-block.go#L17-L27 ROKS and AzureStack are the only deployments using a StatusFilter function... So shouldRunController returns false here because the platform is ROKS: https://github.com/openshift/cluster-storage-operator/blob/dbb1514dbf9923c56a4a198374cc59e45f9bc0cc/pkg/operator/csidriveroperator/driver_starter.go#L347-L349 Which means there is no controller to set `upgradeable: True`
Version-Release number of selected component (if applicable):
4.14.0+
How reproducible:
Always
Steps to Reproduce:
1. Install 4.14 via IBM ROKS 2. Check status conditions in cluster-scoped-resources/config.openshift.io/clusteroperators/storage.yaml
Actual results:
upgradeable=Unknown
Expected results:
upgradeable=True
Additional info:
4.13 IBM ROKS must-gather: https://github.com/Joseph-Goergen/ibm-roks-toolkit/releases/download/test/must-gather-4.13.tar.gz 4.14 IBM ROKS must-gather: https://github.com/Joseph-Goergen/ibm-roks-toolkit/releases/download/test/must-gather.tar.gz
The ServiceMonitors and other related resources were moved in https://issues.redhat.com/browse/MON-669
We thought move RBAC make more sense as well https://github.com/openshift/cluster-monitoring-operator/pull/2039#discussion_r1262307325
This is a clone of issue OCPBUGS-28856. The following is the description of the original issue:
—
Description of problem:
When using the modal dialogs in a hook as part of the actions hook (i.e. useApplicationsActionsProvider) the console will throw an error since the console framework will pass null objects as part of the render cycle. According to Jon Jackson, the console should be safe from null objects but it looks like the code for useDeleteModal and getGroupVersionKindForresource are not safe,
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Use one of the modal APIs in an actions provider hook 2. 3.
Actual results:
Caught error in a child component: TypeError: Cannot read properties of undefined (reading 'split') at i (main-chunk-9fbeef79a…d3a097ed.min.js:1:1) at u (main-chunk-9fbeef79a…d3a097ed.min.js:1:1) at useApplicationActionsProvider (useApplicationActionsProvider.tsx:23:43) at ApplicationNavPage (ApplicationDetails.tsx:38:67) at na (vendors~main-chunk-8…87b.min.js:174297:1) at Hs (vendors~main-chunk-8…87b.min.js:174297:1) at Sc (vendors~main-chunk-8…87b.min.js:174297:1) at Cc (vendors~main-chunk-8…87b.min.js:174297:1) at _c (vendors~main-chunk-8…87b.min.js:174297:1) at pc (vendors~main-chunk-8…87b.min.js:174297:1)
Expected results:
Works with no error
Additional info:
Description of the problem:
Base domain contains double `–` like cat–rahul.com allowed by UI and BE and when node discovered , network validation fails.
Current domain is a private case for using – but note that UI and BE allows to send many – chars as part of domain name.
from agent logs:
Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Creating execution step for ntp-synchronizer ntp-synchronizer-70565cf4 args <[{\"ntp_source\":\"\"}]>" file="step_processor.go:123" request_id=5467e025-2683-4119-a55a-976bb7787279 Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Creating execution step for domain-resolution domain-resolution-f3917dea args <[{\"domains\":[{\"domain_name\":\"api.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"api-int.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"console-openshift-console.apps.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"validateNoWildcardDNS.dummy---dummy.cat--rahul.com.\"},{\"domain_name\":\"validateNoWildcardDNS.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"quay.io\"}]}]>" file="step_processor.go:123" request_id=5467e025-2683-4119-a55a-976bb7787279 Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Validating domain resolution with args [{\"domains\":[{\"domain_name\":\"api.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"api-int.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"console-openshift-console.apps.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"validateNoWildcardDNS.dummy---dummy.cat--rahul.com.\"},{\"domain_name\":\"validateNoWildcardDNS.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"quay.io\"}]}]" file="action.go:29" Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Validating inventory with args [fea3d7b9-a990-48a6-9a46-4417915072b0]" file="action.go:29" Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=error msg="Failed to validate domain resolution: data, {\"domains\":[{\"domain_name\":\"api.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"api-int.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"console-openshift-console.apps.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"validateNoWildcardDNS.dummy---dummy.cat--rahul.com.\"},{\"domain_name\":\"validateNoWildcardDNS.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"quay.io\"}]}" file="action.go:42" error="validation failure list:\nvalidation failure list:\ndomains.0.domain_name in body should match '^([a-zA-Z0-9]+(-[a-zA-Z0-9]+)*[.])+[a-zA-Z]{2,}[.]?$'" Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Validating ntp synchronizer with args [{\"ntp_source\":\"\"}]" file="action.go:29" Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Validating free addresses with args [[\"192.168.123.0/24\"]]" file="action.go:29" Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Executing nsenter [--target 1 --cgroup --mount --ipc --net -- sh -c cp /etc/mtab /root/mtab-fea3d7b9-a990-48a6-9a46-4417915072b0 && podman run --privileged --pid=host --net=host --rm --quiet -v /var/log:/var/log -v /run/udev:/run/udev -v /dev/disk:/dev/disk -v /run/systemd/journal/socket:/run/systemd/journal/socket -v /var/log:/host/var/log:ro -v /proc/meminfo:/host/proc/meminfo:ro -v /sys/kernel/mm/hugepages:/host/sys/kernel/mm/hugepages:ro -v /proc/cpuinfo:/host/proc/cpuinfo:ro -v /root/mtab-fea3d7b9-a990-48a6-9a46-4417915072b0:/host/etc/mtab:ro -v /sys/block:/host/sys/block:ro -v /sys/devices:/host/sys/devices:ro -v /sys/bus:/host/sys/bus:ro -v /sys/class:/host/sys/class:ro -v /run/udev:/host/run/udev:ro -v /dev/disk:/host/dev/disk:ro registry-proxy.engineering.redhat.com/rh-osbs/openshift4-assisted-installer-agent-rhel8:v1.0.0-279 inventory]" file="execute.go:39" Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=error msg="Unable to create runner for step <domain-resolution-f3917dea>, args <[{\"domains\":[{\"domain_name\":\"api.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"api-int.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"console-openshift-console.apps.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"validateNoWildcardDNS.dummy---dummy.cat--rahul.com.\"},{\"domain_name\":\"validateNoWildcardDNS.dummy---dummy.cat--rahul.com\"},{\"domain_name\":\"quay.io\"}]}]>" file="step_processor.go:126" error="validation failure list:\nvalidation failure list:\ndomains.0.domain_name in body should match '^([a-zA-Z0-9]+(-[a-zA-Z0-9]+)*[.])+[a-zA-Z]{2,}[.]?$'" request_id=5467e025-2683-4119-a55a-976bb7787279 Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Executing nsenter [--target 1 --cgroup --mount --ipc --net -- findmnt --raw --noheadings --output SOURCE,TARGET --target /run/media/iso]" file="execute.go:39" Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Executing nsenter [--target 1 --cgroup --mount --ipc --net -- sh -c podman ps --format '{{.Names}}' | grep -q '^free_addresses_scanner$' || podman run --privileged --net=host --rm --quiet --name free_addresses_scanner -v /var/log:/var/log -v /run/systemd/journal/socket:/run/systemd/journal/socket registry-proxy.engineering.redhat.com/rh-osbs/openshift4-assisted-installer-agent-rhel8:v1.0.0-279 free_addresses '[\"192.168.123.0/24\"]']" file="execute.go:39" Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=info msg="Executing nsenter [--target 1 --cgroup --mount --ipc --net -- timeout 30 chronyc -n sources]" file="execute.go:39" Aug 28 11:28:55 master-0-0 next_step_runne[1918]: time="28-08-2023 11:28:55" level=warning msg="Sending step <domain-resolution-f3917dea> reply output <> error <validation failure list:\nvalidation failure list:\ndomains.0.domain_name in body should match '^([a-zA-Z0-9]+(-[a-zA-Z0-9]+)*[.])+[a-zA-Z]{2,}[.]?$'> exit-code <-1>" file="step_processor.go:76" request_id=5467e025-2683-4119-a55a-976bb7787279
How reproducible:
Create a cluster with domain cat–rahul.com with UI fix that allowing it.
Once node discovered , network validation fails on :
Steps to reproduce:
see above
Actual results:
Unable to install cluster due to network validation failure
Expected results:
The domain should be allowed in regex
Please review the following PR: https://github.com/openshift/machine-api-provider-azure/pull/75
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When deploying a HostedCluster and you defines a KAS AdvertiseAddress, it could conflict with the current deployment overlapping with the other networks like Service, Cluster or Machine network, causing a deployment failure.
Version-Release number of selected component (if applicable):
latest
Description of problem:
Agent-based install on vSphere with multiple workers fails
Version-Release number of selected component (if applicable):
4.13.4
How reproducible:
Always
Steps to Reproduce:
1. Create agent-config, install-config for 3 master, 3+ worker cluster 2. Create Agent ISO image 3. Boot targets from Agent ISO
Actual results:
Deployment hangs waiting on cluster operators
Expected results:
Deployment completes
Additional info:
Multiple pods cannot start due to tainted nodes:"4 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}"
Azure techpreview is permafail for about a week:
There's a pod stuck in image pull backoff
NAME READY STATUS RESTARTS AGE azureserviceoperator-controller-manager-6b8fc86684-qgrvc 0/2 ImagePullBackOff 0 6h54m capi-controller-manager-6f96987c5c-zmkpc 1/1 Running 0 6h54m capi-operator-controller-manager-578b9bd48f-gkgzv 2/2 Running 1 (6h55m ago) 7h2m capz-controller-manager-5c6cb77b99-sh98n 1/1 Running 0 6h54m cluster-capi-operator-5974b7684b-4qjwn 1/1 Running 0 7h2m
containerStatuses: - image: registry.ci.openshift.org/openshift:kube-rbac-proxy imageID: "" lastState: {} name: kube-rbac-proxy ready: false restartCount: 0 started: false state: waiting: message: Back-off pulling image "registry.ci.openshift.org/openshift:kube-rbac-proxy" reason: ImagePullBackOff - image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8cc3384be7d81e745ce671c668465ceef75f65652354ce305d7bee3ae21a5976 imageID: "" lastState: {} name: manager ready: false restartCount: 0 started: false state: waiting: message: secret "aso-controller-settings" not found reason: CreateContainerConfigError
All jobs that run seem to hit the same quota problem we saw recently:
failed to grant creds: error syncing creds in mint-mode: error creating custom role: rpc error: code = ResourceExhausted desc = Maximum number of roles reached. Maximum is: 300\nerror details: retry in 24h0m1s
This time it seems to be surfacing on a new credentials request from storage: openshift-gcp-pd-csi-driver-operator which was just moved from predefined roles to fine grained permissions in https://github.com/openshift/cluster-storage-operator/pull/410, likely why we're now tripping over this limit.
We're going to revert and buy time for CCO team to investigate.
The agent-tui interface for editing the network config for the Agent ISO at boot time only runs on the graphical console (tty1). It's difficult to run two copies, so this gives the most value for now.
Although tty1 always exists, OCI only has a serial console available (assuming it is enabled - see OCPBUGS-19092), so the user doesn't see anything on the console while agent-tui is running (and in fact the systemd progress output is suspended for the duration).
Network configuration of any kind is rarely needed in the cloud, anyway. So on OCI specifically we mostly are slowing boot down by 20s for no real reason. We should disable agent-tui in this case - either by disabling the service or simply not adding the binary to the ISO image.
Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver/pull/235
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
ImageRegistry became a new optional component in 4.14 (docs#64469, api#1572). And even before that, it has long been configurable for managementState: Removed. However the no-capabilities test is currently failing like:
message: Back-off pulling image "image-registry.openshift-image-registry.svc:5000/openshift/tools:latest"
in clusters without a local registry. We should teach the origin suite to be more forgiving of a lack of internal registry.
4.14 and 4.15. But possibly 4.14 is now stable enough about 4.14 no-capabilities jobs to backport any fixes.
100%
1. Open a recent 4.15 no-cap run and see if it passed.
Lots of test-cases failing to pull from image-registry.openshift-image-registry.svc:5000 , which isn't expected to exist for these clusters, where the ImageRegistry capability is not requested.
Passing CI test-cases .
I'm fuzzy on the relationship between ImageStreams and the local image registry, but at the moment, the tools ImageStreams and such are still part of no-caps runs:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-no-capabilities/1709539450616287232/artifacts/e2e-aws-ovn-no-capabilities/gather-must-gather/artifacts/must-gather.tar | tar xOz 31e3c46d361008f321d02ef278f62b1fc4e5510a9902c8ac16de5b2078fed849/namespaces/openshift/image.openshift.io/imagestreams.yaml | yaml2json | jq -r '.items[] | select(.metadata.name == "tools").status' { "dockerImageRepository": "", "tags": [ { "items": [ { "created": "2023-10-04T12:24:42Z", "dockerImageReference": "registry.ci.openshift.org/ocp/4.15-2023-10-04-015153@sha256:a83089cbb8a8f4ef868e5f37de5d305c10056e4e9761ad37b7c1ab98f465a553", "generation": 2, "image": "sha256:a83089cbb8a8f4ef868e5f37de5d305c10056e4e9761ad37b7c1ab98f465a553" } ], "tag": "latest" } ] }
Description of problem:
During a highly escalated scenario, we have found the following scenario: - Due to an unrelated problem, 2 control plane nodes had "localhost.localdomain" hostname when their respective sdn-controller pods started (this problem would be out of the scope of this bug report). - As both sdn-controller pods had (and retained) the "localhost.localdomain" hostname, this caused both of them to use "localhost.localdomain" while trying to acquire and renew the controller lease in openshift-network-controller configmap. - This ultimately caused both sdn-controller pods to mistakenly believe that they were the active sdn-controller, so both of them were active at the same time. Such a situation might have a number of undesired (and unknown) side effects. In our case, the result was that two nodes were allocated the same hostsubnet, disrupting pod communication between the 2 nodes and with the other nodes. What we expect from this bug report: That the sdn-controller never tries to acquire a lease as "localhost.localdomain" during a failure scenario. The ideal solution would be to acquire the lease in a way that avoids collisions (more on this on comments), but at the very least, sdn-controller should prefer crash-looping rather than starting with a lease that can collide and wreak havoc.
Version-Release number of selected component (if applicable):
Found on 4.11, but it should be reproducible in 4.13 as well.
How reproducible:
Under some error scenarios where 2 control plane nodes temporarily have "localhost.localdomain" hostname by mistake.
Steps to Reproduce:
1. Start sdn-controller pods 2. 3.
Actual results:
2 sdn-controller pods acquire the lease with "localhost.localdomain" holderIdentity and become active at the same time.
Expected results:
No sdn-controller pod to acquire the lease with "localhost.localdomain" holderIdentity. Either use unique identities even when there is failure scenario or just crash-loop.
Additional info:
Just FYI, the trigger that caused the wrong domain was investigated at this other bug: https://issues.redhat.com/browse/OCPBUGS-11997 However, this situation may happen under other possible failure scenarios, so it is worth preventing it somehow.
This is a clone of issue OCPBUGS-25662. The following is the description of the original issue:
—
Description of problem:
In ROSA/OCP 4.14.z, attaching AmazonEC2ContainerRegistryReadOnly policy to the worker nodes (in ROSA's case, this was attached to the ManagedOpenShift-Worker-Role, which is assigned by the installer to all the worker nodes), has no effect on ECR Image pull. User gets an authentication error. Attaching the policy ideally should avoid the need to provide an image-pull-secret. However, the error is resolved only if the user also provides an image-pull-secret. This is proven to work correctly in 4.12.z. Seems something has changed in the recent OCP versions.
Version-Release number of selected component (if applicable):
4.14.2 (ROSA)
How reproducible:
The issue is reproducible using the below steps.
Steps to Reproduce:
1. Create a deployment in ROSA or OCP on AWS, pointing at a private ECR repository 2. The image pulling will fail with Error: ErrImagePull & authentication required errors 3.
Actual results:
The image pull fails with "Error: ErrImagePull" & "authentication required" errors. However, the image pull is successful only if the user provides an image-pull-secret to the deployment.
Expected results:
The image should be pulled successfully by virtue of the ECR-read-only policy attached to the worker node role; without needing an image-pull-secret.
Additional info:
In other words:
in OCP 4.13 (and below) if a user adds the ECR:* permissions to the worker instance profile, then the user can specify ECR images and authentication of the worker node to ECR is done using the instance profile. In 4.14 this no longer works.
It is not sufficient as an alternative, to provide a pull secret in a deployment because AWS rotates ECR tokens every 12 hours. That is not a viable solution for customers that until OCP 4.13, did not have to rotate pull secrets constantly.
The experience in 4.14 should be the same as in 4.13 with ECR.
The current AWS policy that's used is this one: `arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly`
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ecr:GetAuthorizationToken", "ecr:BatchCheckLayerAvailability", "ecr:GetDownloadUrlForLayer", "ecr:GetRepositoryPolicy", "ecr:DescribeRepositories", "ecr:ListImages", "ecr:DescribeImages", "ecr:BatchGetImage", "ecr:GetLifecyclePolicy", "ecr:GetLifecyclePolicyPreview", "ecr:ListTagsForResource", "ecr:DescribeImageScanFindings" ], "Resource": "*" } ] }
Description of problem:
Change UI to non en_US locale Navigate to Home - Projects - Default - Workloads - Add Page Click on 'Upload JAR file' "Browse" and "Clear" are in English Please see reference screenshot
Version-Release number of selected component (if applicable):
4.14.0-rc.2
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Content is in English
Expected results:
Content should be localized
Additional info:
Reference screenshot https://drive.google.com/file/d/1hgP_Rnkn4J4_gVC-T8pUUvAEiAWbfrJq/view?usp=drive_link
Description of problem:
baremetal 4.14.0-rc.0 ipv6 sno cluster, login as admin user to admin console, there is not Observe menu on the left navigation bar, see picture, https://drive.google.com/file/d/13RAXPxtKhAElN9xf8bAmLJa0GI8pP0fH/view?usp=sharing, monitoring-plugin status is Failed, see: https://drive.google.com/file/d/1YsSaGdLT4bMn-6E-WyFWbOpwvDY4t6na/view?usp=sharing, error is
Failed to get a valid plugin manifest from /api/plugins/monitoring-plugin/ r: Bad Gateway
checked console logs, 9443: connect: connection refused
$ oc -n openshift-console logs console-6869f8f4f4-56mbj ... E0915 12:50:15.498589 1 handlers.go:164] GET request for "monitoring-plugin" plugin failed: Get "https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json": dial tcp [fd02::f735]:9443: connect: connection refused 2023/09/15 12:50:15 http: panic serving [fd01:0:0:1::2]:39156: runtime error: invalid memory address or nil pointer dereference goroutine 183760 [running]: net/http.(*conn).serve.func1() /usr/lib/golang/src/net/http/server.go:1854 +0xbf panic({0x3259140, 0x4fcc150}) /usr/lib/golang/src/runtime/panic.go:890 +0x263 github.com/openshift/console/pkg/plugins.(*PluginsHandler).proxyPluginRequest(0xc0003b5760, 0x2?, {0xc0009bc7d1, 0x11}, {0x3a41fa0, 0xc0002f6c40}, 0xb?) /go/src/github.com/openshift/console/pkg/plugins/handlers.go:165 +0x582 github.com/openshift/console/pkg/plugins.(*PluginsHandler).HandlePluginAssets(0xaa00000000000010?, {0x3a41fa0, 0xc0002f6c40}, 0xc0001f7500) /go/src/github.com/openshift/console/pkg/plugins/handlers.go:147 +0x26d github.com/openshift/console/pkg/server.(*Server).HTTPHandler.func23({0x3a41fa0?, 0xc0002f6c40?}, 0x7?) /go/src/github.com/openshift/console/pkg/server/server.go:604 +0x33 net/http.HandlerFunc.ServeHTTP(...) /usr/lib/golang/src/net/http/server.go:2122 github.com/openshift/console/pkg/server.authMiddleware.func1(0xc0001f7500?, {0x3a41fa0?, 0xc0002f6c40?}, 0xd?) /go/src/github.com/openshift/console/pkg/server/middleware.go:25 +0x31 github.com/openshift/console/pkg/server.authMiddlewareWithUser.func1({0x3a41fa0, 0xc0002f6c40}, 0xc0001f7500) /go/src/github.com/openshift/console/pkg/server/middleware.go:81 +0x46c net/http.HandlerFunc.ServeHTTP(0x5120938?, {0x3a41fa0?, 0xc0002f6c40?}, 0x7ffb6ea27f18?) /usr/lib/golang/src/net/http/server.go:2122 +0x2f net/http.StripPrefix.func1({0x3a41fa0, 0xc0002f6c40}, 0xc0001f7400) /usr/lib/golang/src/net/http/server.go:2165 +0x332 net/http.HandlerFunc.ServeHTTP(0xc001102c00?, {0x3a41fa0?, 0xc0002f6c40?}, 0xc000655a00?) /usr/lib/golang/src/net/http/server.go:2122 +0x2f net/http.(*ServeMux).ServeHTTP(0x34025e0?, {0x3a41fa0, 0xc0002f6c40}, 0xc0001f7400) /usr/lib/golang/src/net/http/server.go:2500 +0x149 github.com/openshift/console/pkg/server.securityHeadersMiddleware.func1({0x3a41fa0, 0xc0002f6c40}, 0x3305040?) /go/src/github.com/openshift/console/pkg/server/middleware.go:128 +0x3af net/http.HandlerFunc.ServeHTTP(0x0?, {0x3a41fa0?, 0xc0002f6c40?}, 0x11db52e?) /usr/lib/golang/src/net/http/server.go:2122 +0x2f net/http.serverHandler.ServeHTTP({0xc0008201e0?}, {0x3a41fa0, 0xc0002f6c40}, 0xc0001f7400) /usr/lib/golang/src/net/http/server.go:2936 +0x316 net/http.(*conn).serve(0xc0009b4120, {0x3a43e70, 0xc001223500}) /usr/lib/golang/src/net/http/server.go:1995 +0x612 created by net/http.(*Server).Serve /usr/lib/golang/src/net/http/server.go:3089 +0x5ed I0915 12:50:24.267777 1 handlers.go:118] User settings ConfigMap "user-settings-4b4c2f4d-159c-4358-bba3-3d87f113cd9b" already exist, will return existing data. I0915 12:50:24.267813 1 handlers.go:118] User settings ConfigMap "user-settings-4b4c2f4d-159c-4358-bba3-3d87f113cd9b" already exist, will return existing data. E0915 12:50:30.155515 1 handlers.go:164] GET request for "monitoring-plugin" plugin failed: Get "https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json": dial tcp [fd02::f735]:9443: connect: connection refused 2023/09/15 12:50:30 http: panic serving [fd01:0:0:1::2]:42990: runtime error: invalid memory address or nil pointer dereference
9443 port is Connection refused
$ oc -n openshift-monitoring get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES alertmanager-main-0 6/6 Running 6 3d22h fd01:0:0:1::564 sno-2 <none> <none> cluster-monitoring-operator-6cb777d488-nnpmx 1/1 Running 4 7d16h fd01:0:0:1::12 sno-2 <none> <none> kube-state-metrics-dc5f769bc-p97m7 3/3 Running 12 7d16h fd01:0:0:1::3b sno-2 <none> <none> monitoring-plugin-85bfb98485-d4g5x 1/1 Running 4 7d16h fd01:0:0:1::55 sno-2 <none> <none> node-exporter-ndnnj 2/2 Running 8 7d16h 2620:52:0:165::41 sno-2 <none> <none> openshift-state-metrics-78df59b4d5-j6r5s 3/3 Running 12 7d16h fd01:0:0:1::3a sno-2 <none> <none> prometheus-adapter-6f86f7d8f5-ttflf 1/1 Running 0 4h23m fd01:0:0:1::b10c sno-2 <none> <none> prometheus-k8s-0 6/6 Running 6 3d22h fd01:0:0:1::566 sno-2 <none> <none> prometheus-operator-7c94855989-csts2 2/2 Running 8 7d16h fd01:0:0:1::39 sno-2 <none> <none> prometheus-operator-admission-webhook-7bb64b88cd-bvq8m 1/1 Running 4 7d16h fd01:0:0:1::37 sno-2 <none> <none> thanos-querier-5bbb764599-vlztq 6/6 Running 6 3d22h fd01:0:0:1::56a sno-2 <none> <none> $ oc -n openshift-monitoring get svc monitoring-plugin NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE monitoring-plugin ClusterIP fd02::f735 <none> 9443/TCP 7d16h $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -v 'https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json' | jq * Trying fd02::f735... * TCP_NODELAY set * connect to fd02::f735 port 9443 failed: Connection refused * Failed to connect to monitoring-plugin.openshift-monitoring.svc.cluster.local port 9443: Connection refused * Closing connection 0 curl: (7) Failed to connect to monitoring-plugin.openshift-monitoring.svc.cluster.local port 9443: Connection refused command terminated with exit code 7
no such issue in other 4.14.0-rc.0 ipv4 cluster, but issue reproduced on other 4.14.0-rc.0 ipv6 cluster.
4.14.0-rc.0 ipv4 cluster,
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-rc.0 True False 20m Cluster version is 4.14.0-rc.0 $ oc -n openshift-monitoring get pod -o wide | grep monitoring-plugin monitoring-plugin-85bfb98485-nh428 1/1 Running 0 4m 10.128.0.107 ci-ln-pby4bj2-72292-l5q8v-master-0 <none> <none> $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k 'https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json' | jq ... { "name": "monitoring-plugin", "version": "1.0.0", "displayName": "OpenShift console monitoring plugin", "description": "This plugin adds the monitoring UI to the OpenShift web console", "dependencies": { "@console/pluginAPI": "*" }, "extensions": [ { "type": "console.page/route", "properties": { "exact": true, "path": "/monitoring", "component": { "$codeRef": "MonitoringUI" } } }, ...
meet issue "9443: Connection refused" in 4.14.0-rc.0 ipv6 cluster(launched cluster-bot cluster: launch 4.14.0-rc.0 metal,ipv6) and login console
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-rc.0 True False 44m Cluster version is 4.14.0-rc.0 $ oc -n openshift-monitoring get pod -o wide | grep monitoring-plugin monitoring-plugin-bd6ffdb5d-b5csk 1/1 Running 0 53m fd01:0:0:4::b worker-0.ostest.test.metalkube.org <none> <none> monitoring-plugin-bd6ffdb5d-vhtpf 1/1 Running 0 53m fd01:0:0:5::9 worker-2.ostest.test.metalkube.org <none> <none> $ oc -n openshift-monitoring get svc monitoring-plugin NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE monitoring-plugin ClusterIP fd02::402d <none> 9443/TCP 59m $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -v 'https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json' | jq * Trying fd02::402d... * TCP_NODELAY set * connect to fd02::402d port 9443 failed: Connection refused * Failed to connect to monitoring-plugin.openshift-monitoring.svc.cluster.local port 9443: Connection refused * Closing connection 0 curl: (7) Failed to connect to monitoring-plugin.openshift-monitoring.svc.cluster.local port 9443: Connection refused command terminated with exit code 7$ oc -n openshift-console get pod | grep console console-5cffbc7964-7ljft 1/1 Running 0 56m console-5cffbc7964-d864q 1/1 Running 0 56m$ oc -n openshift-console logs console-5cffbc7964-7ljft ... E0916 14:34:16.330117 1 handlers.go:164] GET request for "monitoring-plugin" plugin failed: Get "https://monitoring-plugin.openshift-monitoring.svc.cluster.local:9443/plugin-manifest.json": dial tcp [fd02::402d]:9443: connect: connection refused 2023/09/16 14:34:16 http: panic serving [fd01:0:0:4::2]:37680: runtime error: invalid memory address or nil pointer dereference goroutine 3985 [running]: net/http.(*conn).serve.func1() /usr/lib/golang/src/net/http/server.go:1854 +0xbf panic({0x3259140, 0x4fcc150}) /usr/lib/golang/src/runtime/panic.go:890 +0x263 github.com/openshift/console/pkg/plugins.(*PluginsHandler).proxyPluginRequest(0xc0008f6780, 0x2?, {0xc000665211, 0x11}, {0x3a41fa0, 0xc0009221c0}, 0xb?) /go/src/github.com/openshift/console/pkg/plugins/handlers.go:165 +0x582 github.com/openshift/console/pkg/plugins.(*PluginsHandler).HandlePluginAssets(0xfe00000000000010?, {0x3a41fa0, 0xc0009221c0}, 0xc000d8d600) /go/src/github.com/openshift/console/pkg/plugins/handlers.go:147 +0x26d github.com/openshift/console/pkg/server.(*Server).HTTPHandler.func23({0x3a41fa0?, 0xc0009221c0?}, 0x7?) /go/src/github.com/openshift/console/pkg/server/server.go:604 +0x33 net/http.HandlerFunc.ServeHTTP(...) /usr/lib/golang/src/net/http/server.go:2122 github.com/openshift/console/pkg/server.authMiddleware.func1(0xc000d8d600?, {0x3a41fa0?, 0xc0009221c0?}, 0xd?) /go/src/github.com/openshift/console/pkg/server/middleware.go:25 +0x31 github.com/openshift/console/pkg/server.authMiddlewareWithUser.func1({0x3a41fa0, 0xc0009221c0}, 0xc000d8d600) /go/src/github.com/openshift/console/pkg/server/middleware.go:81 +0x46c net/http.HandlerFunc.ServeHTTP(0xc000653830?, {0x3a41fa0?, 0xc0009221c0?}, 0x7f824506bf18?) /usr/lib/golang/src/net/http/server.go:2122 +0x2f net/http.StripPrefix.func1({0x3a41fa0, 0xc0009221c0}, 0xc000d8d500) /usr/lib/golang/src/net/http/server.go:2165 +0x332 net/http.HandlerFunc.ServeHTTP(0xc00007e800?, {0x3a41fa0?, 0xc0009221c0?}, 0xc000b2da00?) /usr/lib/golang/src/net/http/server.go:2122 +0x2f net/http.(*ServeMux).ServeHTTP(0x34025e0?, {0x3a41fa0, 0xc0009221c0}, 0xc000d8d500) /usr/lib/golang/src/net/http/server.go:2500 +0x149 github.com/openshift/console/pkg/server.securityHeadersMiddleware.func1({0x3a41fa0, 0xc0009221c0}, 0x3305040?) /go/src/github.com/openshift/console/pkg/server/middleware.go:128 +0x3af net/http.HandlerFunc.ServeHTTP(0x0?, {0x3a41fa0?, 0xc0009221c0?}, 0x11db52e?) /usr/lib/golang/src/net/http/server.go:2122 +0x2f net/http.serverHandler.ServeHTTP({0xc000db9b00?}, {0x3a41fa0, 0xc0009221c0}, 0xc000d8d500) /usr/lib/golang/src/net/http/server.go:2936 +0x316 net/http.(*conn).serve(0xc000653680, {0x3a43e70, 0xc000676f30}) /usr/lib/golang/src/net/http/server.go:1995 +0x612 created by net/http.(*Server).Serve /usr/lib/golang/src/net/http/server.go:3089 +0x5ed
Version-Release number of selected component (if applicable):
baremetal 4.14.0-rc.0 ipv6 sno cluster, $ token=`oc create token prometheus-k8s -n openshift-monitoring` $ $ oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?' --data-urlencode 'query=virt_platform' | jq { "status": "success", "data": { "resultType": "vector", "result": [ { "metric": { "__name__": "virt_platform", "baseboard_manufacturer": "Dell Inc.", "baseboard_product_name": "01J4WF", "bios_vendor": "Dell Inc.", "bios_version": "1.10.2", "container": "kube-rbac-proxy", "endpoint": "https", "instance": "sno-2", "job": "node-exporter", "namespace": "openshift-monitoring", "pod": "node-exporter-ndnnj", "prometheus": "openshift-monitoring/k8s", "service": "node-exporter", "system_manufacturer": "Dell Inc.", "system_product_name": "PowerEdge R750", "system_version": "Not Specified", "type": "none" }, "value": [ 1694785092.664, "1" ] } ] } }
How reproducible:
ipv6 cluster
Steps to Reproduce:
1. see the description 2. 3.
Actual results:
no Observe menu on admin console, monitoring-plugin is failed
Expected results:
no error
Description of problem:
setting key beging "a" for platform.gcp.userLabels got error message which doesn't explain what's wrong exactly
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-10-15-164249
How reproducible:
Always
Steps to Reproduce:
1. "create install-config" 2. edit the install-config.yaml to insert userLabels settings (see [1]) 3. "create cluster"
Actual results:
Error message shows up telling the label key "a" is invalid.
Expected results:
There should be no error, according to the statement "A label key can have a maximum of 63 characters and cannot be empty. Label must begin with a lowercase letter, and must contain only lowercase letters, numeric characters, and the following special characters `_-`".
Additional info:
$ openshift-install version openshift-install 4.14.0-0.nightly-2023-10-15-164249 built from commit 359866f9f6d8c86e566b0aea7506dad22f59d860 release image registry.ci.openshift.org/ocp/release@sha256:3c5976a39479e11395334f1705dbd3b56580cd1dcbd514a34d9c796b0a0d9f8e release architecture amd64 $ openshift-install explain installconfig.platform.gcp.userLabels KIND: InstallConfig VERSION: v1 RESOURCE: <[]object> userLabels has additional keys and values that the installer will add as labels to all resources that it creates on GCP. Resources created by the cluster itself may not include these labels. This is a TechPreview feature and requires setting CustomNoUpgrade featureSet with GCPLabelsTags featureGate enabled or TechPreviewNoUpgrade featureSet to configure labels. FIELDS: key <string> -required- key is the key part of the label. A label key can have a maximum of 63 characters and cannot be empty. Label must begin with a lowercase letter, and must contain only lowercase letters, numeric characters, and the following special characters `_-`. value <string> -required- value is the value part of the label. A label value can have a maximum of 63 characters and cannot be empty. Value must contain only lowercase letters, numeric characters, and the following special characters `_-`. $ [1] $ yq-3.3.0 r test12/install-config.yaml platform gcp: projectID: openshift-qe region: us-central1 userLabels: - key: createdby value: installer-qe - key: a value: hello $ yq-3.3.0 r test12/install-config.yaml featureSet TechPreviewNoUpgrade $ yq-3.3.0 r test12/install-config.yaml credentialsMode Passthrough $ openshift-install create cluster --dir test12 ERROR failed to fetch Metadata: failed to load asset "Install Config": failed to create install config: invalid "install-config.yaml" file: platform.gcp.userLabels[a]: Invalid value: "hello": label key is invalid or contains invalid characters. Label key can have a maximum of 63 characters and cannot be empty. Label key must begin with a lowercase letter, and must contain only lowercase letters, numeric characters, and the following special characters `_-` $
Many jobs are failing because route53 is throttling us during cluster creation.
We need a make external-dns make fewer calls.
The theoretical minimum is:
list zones - 1 call
list zone records - (# of records / 100) calls
create 3 records per HC - 1-3 calls depending on how they are batched
==== This Jira covers only haproxy component ====
Description of problem:
Pods running in the namespace openshift-vsphere-infra are so much verbose printing as INFO messages that should debug. This excesse of verbosity has an impact in CRIO, in the node and also in the Logging system. For instance, having 71 nodes, the number of logs coming from this namespace in 1 month was: 450.000.000 meaning 1TB of logs written to disk on the node by CRIO, reading but the Red Hat log collector and stored in the Log Store. Added to the impact on the performance, it have a financial impact for the storage needed. Examples of logs are that adjust better to DEBUG and not as INFO: ``` /// For keep-alive pods are printed 4 messages per node each 10 seconds per node, in this example, the number of nodes is 71, then, this means 284 log entries per second, then 1704 log entries by minute and keepalive pod $ oc logs keepalived-master.example-0 -c keepalived-monitor |grep master.example-0|grep 2024-02-15T08:20:21 |wc -l $ oc logs keepalived-master-example-0 -c keepalived-monitor |grep worker-example-0|grep 2024-02-15T08:20:21 2024-02-15T08:20:21.671390814Z time="2024-02-15T08:20:21Z" level=info msg="Searching for Node IP of worker-example-0. Using 'x.x.x.x/24' as machine network. Filtering out VIPs '[x.x.x.x x.x.x.x]'." 2024-02-15T08:20:21.671390814Z time="2024-02-15T08:20:21Z" level=info msg="For node worker-example-0 selected peer address x.x.x.x using NodeInternalIP" 2024-02-15T08:20:21.733399279Z time="2024-02-15T08:20:21Z" level=info msg="Searching for Node IP of worker-example-0. Using 'x.x.x.x' as machine network. Filtering out VIPs '[x.x.x.x x.x.x.x]'." 2024-02-15T08:20:21.733421398Z time="2024-02-15T08:20:21Z" level=info msg="For node worker-example-0 selected peer address x.x.x.x using NodeInternalIP" /// For haproxy logs observed 2 logs printed per 6 seconds for each master, this means 6 messages in the same second, 60 messages/minute per pod $ oc logs haproxy-master-0-example -c haproxy-monitor ... 2024-02-15T08:20:00.517159455Z time="2024-02-15T08:20:00Z" level=info msg="Searching for Node IP of master-example-0. Using 'x.x.x.x/24' as machine network. Filtering out VIPs '[x.x.x.x]'." 2024-02-15T08:20:00.517159455Z time="2024-02-15T08:20:00Z" level=info msg="For node master-example-0 selected peer address x.x.x.x using NodeInternalIP"
Version-Release number of selected component (if applicable):
OpenShift 4.14 VSphere IPI installation
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift 4.14 Vsphere IPI environment 2. Review the logs of the haproxy pods and keealived pods running in the namespace `openshift-vsphere-infra`
Actual results:
The pods haproxy-* and keepalived-* pods being so much verbose printing as INFO messages should be as DEBUG. Some of the messages are available in the Description of the problem in the present bug.
Expected results:
Printed as INFO only relevant messages helping to reduce the verbosity of the pods running in the namespace `openshift-vsphere-infra`
Additional info:
Please review the following PR: https://github.com/openshift/azure-workload-identity/pull/6
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
When we Load the AgentClusterInstall manifest from disk, we sometimes make changes to it.
e.g. after the fix for OCPBUGS-7495 we rewrite any lowercase platform name to mixed case, because for a while we required lowercase even when mixed case is correct.
In 4.14, we set the userManagedNetworking to true when platform:none is used, even if the user didn't specify it in the ZTP manifests, because the controller in ZTP similarly defaults it.
However, these changes aren't taking effect, because they aren't passed through to the manifest that is included in the Agent ISO.
Please review the following PR: https://github.com/openshift/kube-state-metrics/pull/97
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/operator-framework-olm/pull/621
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
This was discovered during Contrail testing when a large number of additional manifests specific to contrail were added to the openshift/ dir. The additional manifests are here - https://github.com/Juniper/contrail-networking/tree/main/releases/23.1/ocp. When creating the agent image the following error occurred: failed to fetch Agent Installer ISO: failed to generate asset \"Agent Installer ISO\": failed to create overwrite reader for ignition: content length (802204) exceeds embed area size (262144)"]
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Multi-egress source route entries do not get properly updated with adminpolicybasedexternalroutes CR
Version-Release number of selected component (if applicable):
Upstream ovn-kubernetes commit c60963123d28075288a8c23d2796c2df89f54601
How reproducible (100%):
Create a served/application pod after creating the adminpolicybasedexternalroutes CR. The corresponding source route entries wont be added to the worker routing table
Steps to Reproduce:
1. Create a ovn-kubernetes kind cluster: ./kind.sh --install-cni-plugins --disable-snat-multiple-gws --multi-network-enable 2. Create two namespaces: $ cat <<EOF | kubectl apply -f - --- apiVersion: v1 kind: Namespace metadata: name: frr labels: gws: "true" spec: {} --- apiVersion: v1 kind: Namespace metadata: name: bar labels: multiple_gws: "true" spec: {} EOF 3. Create a network attachment definition: $ cat <<EOF | kubectl apply -f - apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: name: internal-net namespace: frr spec: config: |- { "cniVersion": "0.3.1", "name": "internal-net", "plugins": [ { "type": "macvlan", "master": "breth0", "mode": "bridge", "ipam": { "type": "static" } }, { "capabilities": { "mac": true, "ips": true }, "type": "tuning" } ] } EOF 4. Create the first dummy pod: $ cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: dummy1 namespace: bar spec: containers: - name: dummy image: centos command: - sleep - infinity nodeSelector: kubernetes.io/hostname: ovn-worker2 EOF 5. Create the AdminPolicyBasedExternalRoute CR: $ cat <<EOF | kubectl apply -f - apiVersion: k8s.ovn.org/v1 kind: AdminPolicyBasedExternalRoute metadata: name: honeypotting spec: ## gateway example from: namespaceSelector: matchLabels: multiple_gws: "true" nextHops: dynamic: - podSelector: matchLabels: gw: "true" bfdEnabled: true namespaceSelector: matchLabels: gws: "true" networkAttachmentName: frr/internal-net EOF 6. Create the lb pod: $ cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: ext-gw labels: gw: "true" namespace: frr annotations: k8s.v1.cni.cncf.io/networks: '[ { "name": "internal-net", "ips": [ "172.18.0.10/16" ] } ]' spec: containers: - name: frr image: centos command: - sleep - infinity securityContext: privileged: true nodeSelector: kubernetes.io/hostname: ovn-worker EOF 7. Create a second dummy pod: $ cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: dummy2 namespace: bar spec: containers: - name: dummy image: centos command: - sleep - infinity nodeSelector: kubernetes.io/hostname: ovn-worker2 EOF
Actual results:
Only source route entries for the first dummy pod were created: $ kubectl get po -o wide -n bar dummy1 Running 10.244.1.3 dummy2 Running 10.244.1.4 $ POD=$(kubectl get pod -n ovn-kubernetes -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' | grep ovnkube-db-) ; kubectl exec -ti $POD -n ovn-kubernetes -c nb-ovsdb -- bash [root@ovn-control-plane ~]# ovn-nbctl lr-route-list GR_ovn-worker2 IPv4 Routes Route Table <main>: 10.244.1.3 172.18.0.10 src-ip exgw-rtoe-GR_ovn-worker2 ecmp-symmetric-reply bfd 169.254.169.0/29 169.254.169.4 dst-ip rtoe-GR_ovn-worker2 10.244.0.0/16 100.64.0.1 dst-ip 0.0.0.0/0 172.18.0.1 dst-ip rtoe-GR_ovn-worker
Expected results:
Source route entries for both dummy pods created: [root@ovn-control-plane ~]# ovn-nbctl lr-route-list GR_ovn-worker2 IPv4 Routes Route Table <main>: 10.244.1.3 172.18.0.10 src-ip exgw-rtoe-GR_ovn-worker2 ecmp-symmetric-reply bfd 10.244.1.4 172.18.0.10 src-ip exgw-rtoe-GR_ovn-worker2 ecmp-symmetric-reply bfd 169.254.169.0/29 169.254.169.4 dst-ip rtoe-GR_ovn-worker2 10.244.0.0/16 100.64.0.1 dst-ip 0.0.0.0/0 172.18.0.1 dst-ip rtoe-GR_ovn-worke
Additional info:
$ kubectl describe adminpolicybasedexternalroutes ... Status: Last Transition Time: 2023-09-25T09:50:25Z Messages: Configured external gateway IPs: 172.18.0.10 Status: Success Events: <none>
Description of problem:
Link for CodeEditor component are returning 404. Check link for options and ref parameters https://github.com/openshift/console/blob/master/frontend/packages/console-dynamic-plugin-sdk/docs/api.md#codeeditor
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
404
Expected results:
200
Additional info:
Description of the problem:
non-lowercase hostname in DHCP breaks assisted installation
How reproducible:
100%
Steps to reproduce:
Actual results:
bootkube fails
Expected results:{}
bootkube should succeed
Please review the following PR: https://github.com/openshift/ibm-vpc-node-label-updater/pull/25
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When installing OpenShift on GCP in a Shared VPC (formerly XPN) configuration, the service account used must have permissions to create firewall rules on the host project's network in order to proceed. If the account does not have permissions, the installation will fail but the explicit reason is not listed.
Version-Release number of selected component (if applicable):
4.14-ec.1
How reproducible:
100% of the time when the service account creating the cluster does not have Owner permissions or `compute.firewall.create` on the host project.
Steps to Reproduce:
1. Follow instructions at https://docs.openshift.com/container-platform/4.13/installing/installing_gcp/installing-gcp-shared-vpc.html 2. As part of the prerequisites, make a service account with the permissions listed at https://docs.openshift.com/container-platform/4.13/installing/installing_gcp/installing-gcp-account.html#minimum-required-permissions-ipi-gcp-xpn 3. Create a cluster using an install-config.yaml similar to the one attached
Actual results:
The cluster fails to bootstrap. The bootstrap node will be present, as will the masters, but components will not be able to reach the api-int load balancer.
Expected results:
The log files would include an error message regarding the missing permissions, and possibly abort the installation early.
Additional info:
https://docs.openshift.com/container-platform/4.13/installing/installing_gcp/installing-gcp-account.html#minimum-required-permissions-ipi-gcp-xpn does not list the `compute.firewalls.create` permission, which is included in the code at https://github.com/openshift/installer/blob/4f59664588c4472b7aba2838159651e729908dff/pkg/asset/cluster/tfvars.go#L79. This is probably also a related docs improvement.
File attachment seems to have been disabled, so here is the text of the `install-config.yaml` that I was using:
additionalTrustBundlePolicy: Proxyonly apiVersion: v1 baseDomain: installer.gcp.devcluster.openshift.com compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: {} replicas: 3 controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: {} replicas: 3 credentialsMode: Passthrough featureSet: TechPreviewNoUpgrade metadata: creationTimestamp: null name: nrbxpn networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 machineNetwork: - cidr: 10.0.0.0/16 networkType: OVNKubernetes serviceNetwork: - 172.30.0.0/16 platform: gcp: projectID: openshift-installer-shared-vpc region: us-central1 network: installer-shared-vpc computeSubnet: installer-shared-vpc-subnet-1 controlPlaneSubnet: installer-shared-vpc-subnet-2 networkProjectID: openshift-dev-installer publish: Internal pullSecret: <omitted> sshKey: <omitted>
SB and NB containers have this command to expose their DB via SSL and set the inactivity probe interval. With OVN-IC we don't use SSL for the DBs anymore, so we can remove that bit.
if ! retry 60 "inactivity-probe" "ovn-sbctl --no-leader-only -t 5 set-connection pssl:.OVN_SB_PORT.LISTEN_DUAL_STACK – set connection . inactivity_probe=.OVN_CONTROLLER_INACTIVITY_PROBE"; then
should become:
if ! retry 60 "inactivity-probe" "ovn-sbctl --no-leader-only -t 5 set connection . inactivity_probe=.OVN_CONTROLLER_INACTIVITY_PROBE"; then
Also we can clean up the comment at the end where it polls the IPsec status, which is just a way of making sure the DB is ready and answering queries. We dont' need to wait for the cluster to converge (since there's no RAFT) but could change it to:
"Kill some time while DB becomes ready by checking IPsec status"
Version: 4.11.0-0.nightly-2022-06-22-015220
$ openshift-install version
openshift-install 4.11.0-0.nightly-2022-06-22-015220
built from commit f912534f12491721e3874e2bf64f7fa8d44aa7f5
release image registry.ci.openshift.org/ocp/release@sha256:9c2e9cafaaf48464a0d27652088d8fb3b2336008a615868aadf8223202bdc082
release architecture amd64
Platform: OSP 16.1.8 with manila service
Please specify:
What happened?
In a fresh 4.11 cluster (with Kuryr, but shouldn't be related to the issue), there are not endpoints
for manila metrics:
> $ oc -n openshift-manila-csi-driver get endpoints
NAME ENDPOINTS AGE
manila-csi-driver-controller-metrics <none> 3h7m
> $ oc -n openshift-manila-csi-driver describe endpoints
Name: manila-csi-driver-controller-metrics
Namespace: openshift-manila-csi-driver
Labels: app=manila-csi-driver-controller-metrics
Annotations: endpoints.kubernetes.io/last-change-trigger-time: 2022-06-22T10:30:06Z
Subsets:
Events: <none>
> $ oc -n openshift-manila-csi-driver get all
NAME READY STATUS RESTARTS AGE
pod/csi-nodeplugin-nfsplugin-4mqgx 1/1 Running 0 3h7m
pod/csi-nodeplugin-nfsplugin-555ns 1/1 Running 0 3h2m
pod/csi-nodeplugin-nfsplugin-bn26j 1/1 Running 0 3h7m
pod/csi-nodeplugin-nfsplugin-lfsm7 1/1 Running 0 3h1m
pod/csi-nodeplugin-nfsplugin-xwxnz 1/1 Running 0 3h1m
pod/csi-nodeplugin-nfsplugin-zqnkt 1/1 Running 0 3h7m
pod/openstack-manila-csi-controllerplugin-7fc4b4f56d-ddn25 6/6 Running 2 (158m ago) 3h7m
pod/openstack-manila-csi-controllerplugin-7fc4b4f56d-p9jss 6/6 Running 0 3h6m
pod/openstack-manila-csi-nodeplugin-6w426 2/2 Running 0 3h2m
pod/openstack-manila-csi-nodeplugin-fvsjr 2/2 Running 0 3h7m
pod/openstack-manila-csi-nodeplugin-g9x4t 2/2 Running 0 3h1m
pod/openstack-manila-csi-nodeplugin-gp76x 2/2 Running 0 3h7m
pod/openstack-manila-csi-nodeplugin-n9v9t 2/2 Running 0 3h7m
pod/openstack-manila-csi-nodeplugin-s6srv 2/2 Running 0 3h1m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/manila-csi-driver-controller-metrics ClusterIP 172.30.118.232 <none> 443/TCP,444/TCP 3h7m
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/csi-nodeplugin-nfsplugin 6 6 6 6 6 <none> 3h7m
daemonset.apps/openstack-manila-csi-nodeplugin 6 6 6 6 6 <none> 3h7m
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/openstack-manila-csi-controllerplugin 2/2 2 2 3h7m
NAME DESIRED CURRENT READY AGE
replicaset.apps/openstack-manila-csi-controllerplugin-5697ccfcbf 0 0 0 3h7m
replicaset.apps/openstack-manila-csi-controllerplugin-7fc4b4f56d 2 2 2 3h7m
This can lead to not being able to retrieve manila metrics.
openshift_install.log: http://rhos-ci-logs.lab.eng.tlv2.redhat.com/logs/j2pg/DFG-osasinfra-shiftstack_periodic_subjob-ocp_install-4.11-kuryr-ipi/15/undercloud-0/home/stack/ostest/.openshift_install.log.gz
cinder-csi for example is configured with such endpoints:
> $ oc -n openshift-cluster-csi-drivers get endpoints
NAME ENDPOINTS AGE
openstack-cinder-csi-driver-controller-metrics 10.196.1.100:9203,10.196.2.82:9203,10.196.1.100:9205 + 5 more... 3h15m
> $ oc -n openshift-cluster-csi-drivers describe endpoints
Name: openstack-cinder-csi-driver-controller-metrics
Namespace: openshift-cluster-csi-drivers
Labels: app=openstack-cinder-csi-driver-controller-metrics
Annotations: endpoints.kubernetes.io/last-change-trigger-time: 2022-06-22T10:58:57Z
Subsets:
Addresses: 10.196.1.100,10.196.2.82
NotReadyAddresses: <none>
Ports:
Name Port Protocol
---- ---- --------
attacher-m 9203 TCP
snapshotter-m 9205 TCP
provisioner-m 9202 TCP
resizer-m 9204 TCP
Events: <none>
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cluster-authentication-operator/pull/643
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Invalid egressIP object caused ovnkube-node pods CLBO
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-10-05-195247
How reproducible:
Always
Steps to Reproduce:
1. Label one node as egress node 2. Created an egressIP object, with empty label key and value oc get egressip -o yaml apiVersion: v1 items: - apiVersion: k8s.ovn.org/v1 kind: EgressIP metadata: creationTimestamp: "2023-10-07T09:08:28Z" generation: 2 name: egressip-test resourceVersion: "122021" uid: 23445450-37d5-4ec3-b8fe-d8352a19e703 spec: egressIPs: - 10.0.70.100 namespaceSelector: matchLabels: "": "" podSelector: matchLabels: "": "" status: items: - egressIP: 10.0.70.100 node: ip-10-0-70-135 kind: List metadata: resourceVersion: "" 3. Created namespace and test pods
Actual results:
Test pods was stuck in ContainerCreating status % oc get pods -n hrw NAME READY STATUS RESTARTS AGE test-rc-hwmns 0/1 ContainerCreating 0 45s test-rc-p9kl8 0/1 ContainerCreating 0 45s % oc describe pod test-rc-hwmns -n hrw Name: test-rc-hwmns Namespace: hrw Priority: 0 Service Account: default Node: ip-10-0-70-125/10.0.70.125 Start Time: Sat, 07 Oct 2023 17:08:50 +0800 Labels: name=test-pods Annotations: k8s.ovn.org/pod-networks: {"default":{"ip_addresses":["10.129.2.11/23"],"mac_address":"0a:58:0a:81:02:0b","gateway_ips":["10.129.2.1"],"routes":[{"dest":"10.128.0.0... openshift.io/scc: restricted-v2 seccomp.security.alpha.kubernetes.io/pod: runtime/default Status: Pending IP: IPs: <none> Controlled By: ReplicationController/test-rc Containers: test-pod: Container ID: Image: quay.io/openshifttest/hello-sdn@sha256:c89445416459e7adea9a5a416b3365ed3d74f2491beb904d61dc8d1eb89a72a4 Image ID: Port: <none> Host Port: <none> State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Limits: memory: 340Mi Requests: memory: 340Mi Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7vlz8 (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: kube-api-access-7vlz8: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: Burstable Node-Selectors: <none> Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 59s default-scheduler Successfully assigned hrw/test-rc-hwmns to ip-10-0-70-125 Warning FailedCreatePodSandBox 59s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-rc-hwmns_hrw_d72a4216-b94b-4034-a9f7-526758055994_0(1ad74472b9e985cee4a3081f5912b3d4553351d14764d3bfece1d174146f90ca): error adding pod hrw_test-rc-hwmns to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&{ContainerID:1ad74472b9e985cee4a3081f5912b3d4553351d14764d3bfece1d174146f90ca Netns:/var/run/netns/131f3670-1a49-4088-9002-5624a3acc6d3 IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=hrw;K8S_POD_NAME=test-rc-hwmns;K8S_POD_INFRA_CONTAINER_ID=1ad74472b9e985cee4a3081f5912b3d4553351d14764d3bfece1d174146f90ca;K8S_POD_UID=d72a4216-b94b-4034-a9f7-526758055994 Path: StdinData:[123 34 98 105 110 68 105 114 34 58 34 47 118 97 114 47 108 105 98 47 99 110 105 47 98 105 110 34 44 34 99 104 114 111 111 116 68 105 114 34 58 34 47 104 111 115 116 114 111 111 116 34 44 34 99 108 117 115 116 101 114 78 101 116 119 111 114 107 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 99 110 105 47 110 101 116 46 100 47 49 48 45 111 118 110 45 107 117 98 101 114 110 101 116 101 115 46 99 111 110 102 34 44 34 99 110 105 67 111 110 102 105 103 68 105 114 34 58 34 47 104 111 115 116 47 101 116 99 47 99 110 105 47 110 101 116 46 100 34 44 34 99 110 105 86 101 114 115 105 111 110 34 58 34 48 46 51 46 49 34 44 34 100 97 101 109 111 110 83 111 99 107 101 116 68 105 114 34 58 34 47 114 117 110 47 109 117 108 116 117 115 47 115 111 99 107 101 116 34 44 34 103 108 111 98 97 108 78 97 109 101 115 112 97 99 101 115 34 58 34 100 101 102 97 117 108 116 44 111 112 101 110 115 104 105 102 116 45 109 117 108 116 117 115 44 111 112 101 110 115 104 105 102 116 45 115 114 105 111 118 45 110 101 116 119 111 114 107 45 111 112 101 114 97 116 111 114 34 44 34 108 111 103 76 101 118 101 108 34 58 34 118 101 114 98 111 115 101 34 44 34 108 111 103 84 111 83 116 100 101 114 114 34 58 116 114 117 101 44 34 109 117 108 116 117 115 65 117 116 111 99 111 110 102 105 103 68 105 114 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 99 110 105 47 110 101 116 46 100 34 44 34 109 117 108 116 117 115 67 111 110 102 105 103 70 105 108 101 34 58 34 97 117 116 111 34 44 34 110 97 109 101 34 58 34 109 117 108 116 117 115 45 99 110 105 45 110 101 116 119 111 114 107 34 44 34 110 97 109 101 115 112 97 99 101 73 115 111 108 97 116 105 111 110 34 58 116 114 117 101 44 34 112 101 114 78 111 100 101 67 101 114 116 105 102 105 99 97 116 101 34 58 123 34 98 111 111 116 115 116 114 97 112 75 117 98 101 99 111 110 102 105 103 34 58 34 47 104 111 115 116 114 111 111 116 47 118 97 114 47 108 105 98 47 107 117 98 101 108 101 116 47 107 117 98 101 99 111 110 102 105 103 34 44 34 99 101 114 116 68 105 114 34 58 34 47 101 116 99 47 99 110 105 47 109 117 108 116 117 115 47 99 101 114 116 115 34 44 34 99 101 114 116 68 117 114 97 116 105 111 110 34 58 34 50 52 104 34 44 34 101 110 97 98 108 101 100 34 58 116 114 117 101 125 44 34 115 111 99 107 101 116 68 105 114 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 115 111 99 107 101 116 34 44 34 116 121 112 101 34 58 34 109 117 108 116 117 115 45 115 104 105 109 34 125]} ContainerID:"1ad74472b9e985cee4a3081f5912b3d4553351d14764d3bfece1d174146f90ca" Netns:"/var/run/netns/131f3670-1a49-4088-9002-5624a3acc6d3" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=hrw;K8S_POD_NAME=test-rc-hwmns;K8S_POD_INFRA_CONTAINER_ID=1ad74472b9e985cee4a3081f5912b3d4553351d14764d3bfece1d174146f90ca;K8S_POD_UID=d72a4216-b94b-4034-a9f7-526758055994" Path:"" ERRORED: error configuring pod [hrw/test-rc-hwmns] networking: [hrw/test-rc-hwmns/d72a4216-b94b-4034-a9f7-526758055994:ovn-kubernetes]: error adding container to network "ovn-kubernetes": failed to send CNI request: Post "http://dummy/": dial unix /var/run/ovn-kubernetes/cni//ovn-cni-server.sock: connect: connection refused ' % oc get pods -n openshift-ovn-kubernetes NAME READY STATUS RESTARTS AGE ovnkube-control-plane-85f96b444b-2bdwf 2/2 Running 0 5h27m ovnkube-control-plane-85f96b444b-2mhfj 2/2 Running 0 5h27m ovnkube-control-plane-85f96b444b-ddjhx 2/2 Running 0 5h27m ovnkube-node-5fkb5 7/8 CrashLoopBackOff 6 (2m52s ago) 13m ovnkube-node-p7qvr 7/8 CrashLoopBackOff 6 (2m56s ago) 13m ovnkube-node-tzhlb 7/8 CrashLoopBackOff 6 (2m51s ago) 13m ovnkube-node-x5849 7/8 CrashLoopBackOff 6 (2m57s ago) 13m ovnkube-node-xscbr 7/8 CrashLoopBackOff 6 (2m35s ago) 13m exec /usr/bin/ovnkube --init-ovnkube-controller "${K8S_NODE}" --init-node "${K8S_NODE}" \ --config-file=/run/ovnkube-config/ovnkube.conf \ --ovn-empty-lb-events \ --loglevel "${OVN_KUBE_LOG_LEVEL}" \ --inactivity-probe="${OVN_CONTROLLER_INACTIVITY_PROBE}" \ ${gateway_mode_flags} \ ${node_mgmt_port_netdev_flags} \ --metrics-bind-address "127.0.0.1:29103" \ --ovn-metrics-bind-address "127.0.0.1:29105" \ --metrics-enable-pprof \ --metrics-enable-config-duration \ --export-ovs-metrics \ --disable-snat-multiple-gws \ ${export_network_flows_flags} \ ${multi_network_enabled_flag} \ ${multi_network_policy_enabled_flag} \ ${admin_network_policy_enabled_flag} \ --enable-multicast \ --zone ${K8S_NODE} \ --enable-interconnect \ --acl-logging-rate-limit "20" \ ${gw_interface_flag} \ --enable-multi-external-gateway=true \ ${ip_forwarding_flag} \ ${NETWORK_NODE_IDENTITY_ENABLE} State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Message: vn-kubernetes/go-controller/pkg/retry.(*RetryFramework).WatchResourceFiltered.func1.1({0xc0007cb368, 0x11}) /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/retry/obj_retry.go:531 +0x2c7 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/retry.(*RetryFramework).DoWithLock(0xc000d4eb40, {0xc0007cb368, 0x11}, 0xc000e43dd0) /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/retry/obj_retry.go:137 +0xce github.com/ovn-org/ovn-kubernetes/go-controller/pkg/retry.(*RetryFramework).WatchResourceFiltered.func1({0x22eede0, 0xc000c6fec0}) /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/retry/obj_retry.go:504 +0x265 k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:243 k8s.io/client-go/tools/cache.FilteringResourceEventHandler.OnAdd({0xc00111bdc0?, {0x26d0aa0?, 0xc001580570?}}, {0x22eede0, 0xc000c6fec0}, 0xa0?) /go/src/github.com/openshift/ovn-kubernetes/go-controller/vendor/k8s.io/client-go/tools/cache/controller.go:306 +0x6e github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.(*Handler).OnAdd(...) /go/src/github.com/openshift/ovn-kubernetes/go-controller/pkg/factory/handler.go:52 github.com/ovn-org/ovn-kubernetes/go-controller/pkg/factory.newQueuedInformer.func1.1(0xc000e43da0?) Exit Code: 2 Started: Sat, 07 Oct 2023 17:14:38 +0800 Finished: Sat, 07 Oct 2023 17:14:39 +0800 Ready: False Restart Count: 6 Requests: cpu: 10m memory: 600Mi
Expected results:
Add some checking point about labels ? Give the warning that the key should not be empty and not able to apply?
Additional info:
Description of problem:
There are some duplicated logs originating from calling addOrUpdateSubnet twice, this is missleading.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. Start it up 2. Check logs. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Customer reported that keepalived pods crashes and fail to start on worker node (Ingress VIP). The expectation is that the keepalived pod (labeled by app=kni-infra-vrrp) should start. This affects everyone using OCP v4.13 together with Ingress VIP and could be a potential bug in the nodeip-configuration service in v4.13.
More details as below:
-> There are 2 problems in OCP v4.13. The regexp expression won't match and the chroot command will fail because of missing ldd libraries inside the container. This has been fixed on 4.14, but not on 4.13.
-> The nodeip-configuration service creates the /run/nodeip-configuration/remote-worker file based on onPremPlatformAPIServerInternalIPs (apiVIP) and ignores the onPremPlatformIngressIPs (ingressVIP) as can be seen in source code.
-> Then the keepalived process wont start because the remote-worker file exists.
-> The liveness probes will fail because the keepalived process does not exist.
The fix is quite simple(as highlighted by the customer), The nodeip-configuration.service template needs to be to extended to consider the Ingress VIPs as well. This is the source code where changes need to be done
As per the following code snippet, The NODE-IP ranges only over the onPremPlatformAPIServerInternalIPs and ignores the onPremPlatformIngressIPs.
node-ip \ set \ --platform {{ .Infra.Status.PlatformStatus.Type }} \ {{if not (isOpenShiftManagedDefaultLB .) -}} --user-managed-lb \ {{end -}} {{if or (eq .IPFamilies "IPv6") (eq .IPFamilies "DualStackIPv6Primary") -}} --prefer-ipv6 \ {{end -}} --retry-on-failure \ {{ range onPremPlatformAPIServerInternalIPs . }}{{.}} {{end}}; \ do \ sleep 5; \ done"
Difference between OCPv 4.12 and v4.13 related to keepalived pod is also indicated in this image attached
Version-Release number of selected component (if applicable):
v4.13
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
The keepalived pods crashes and fail to start on worker node (Ingress VIP)
Expected results:
The expectation is that the keepalived pod (labeled by app=kni-infra-vrrp) should start.
Additional info:
Description of problem:
With OCPBUGS-18274 we had to update the etcdctl binary. Unfortunately the script does not attempt to update the binary if it's found in the path already: https://github.com/openshift/cluster-etcd-operator/blob/master/bindata/etcd/etcd-common-tools#L16-L24 This causes confusion as the binary might not be the latest that we're shipping with etcd. Pulling the binary shouldn't be a big deal, etcd is running locally anyway and the local image should be cached already just fine. We should always replace the binary
Version-Release number of selected component (if applicable):
any currently supported release
How reproducible:
always
Steps to Reproduce:
1. run cluster-backup.sh to download the binary 2. update the etcd image (take a different version or so) 3. run cluster-backup.sh again
Actual results:
cluster-backup.sh will simply print "etcdctl is already installed"
Expected results:
etcdctl should always be pulled
Additional info:
Description of problem:
The shutdown-delay-duration argument for the openshift-apiserver is set to 3s in hypershift, but set to 15s in core openshift. Hypershift should update the value to match.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Diff the openshift-apiserver configs
Actual results:
https://github.com/openshift/hypershift/blob/3a42e77041535c8ac8012856d279bc782efcaf3c/control-plane-operator/controllers/hostedcontrolplane/oapi/config.go#L59C1-L60C1
Expected results:
https://github.com/openshift/cluster-openshift-apiserver-operator/commit/cad9746b62abf3b3230592d45f7f60bcecc96dac
Additional info:
Description of problem:
I'm seeing Prometheus disruption failures in upgrade tests
Version-Release number of selected component (if applicable):
How reproducible:
Sporadically
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/csi-driver-shared-resource/pull/144
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Recently we bumped the hyperkube image [1] to use both RHEL 9 builder and base images. In order to keep things consistent, we tried to do the same with the "pause" image [2], however, that caused mass failures in payload jobs [3] due to a mismatch with ART [4], which still builds that image with RHEL 8. As a result, we decided to keep builder & base images for "pause" in RHEL 8, as this work was not required for the kube 1.28 bump nor the FIPS issue we were addressing. However, for the sake of consistency, eventually it'd be good to bump the "pause" builder & base images to RHEL 9. [1] https://github.com/openshift/kubernetes/blob/6ab54b8d9a0ea02856efd3835b6f9df5da9ce115/openshift-hack/images/hyperkube/Dockerfile.rhel#L1 [2] https://github.com/openshift/kubernetes/blob/6ab54b8d9a0ea02856efd3835b6f9df5da9ce115/build/pause/Dockerfile.Rhel#L1 [3] https://github.com/openshift/kubernetes/blob/6ab54b8d9a0ea02856efd3835b6f9df5da9ce115/build/pause/Dockerfile.Rhel#L1 [4] https://github.com/openshift-eng/ocp-build-data/blob/openshift-4.15/images/openshift-enterprise-pod.yml
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
Builder & base images for "pause" are RHEL 8.
Expected results:
Builder & base images for "pause" are RHEL 9.
Additional info:
Description of problem:
When the "start build with broken proxy should start a build and wait for the build to fail [apigroup:build.openshift.io]" test runs, it expects the build to exit with a failure before printing the text "clone" for its log. Part of attempting to add a variant of this test which exercises the same functionality using an unprivileged build involves turning up the logging level so that the builder will log information that the test can look for which confirms that it was run in an unprivileged mode. I'd like for it to print the name under which it was invoked, so that it's easier to find where a particular container's output starts in the log, but that name is openshift-git-clone. The log message which would indicate that the test failed includes the text "git clone", so I'd like to amend the test to fail when that text is found in the log instead.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Modify the test to increase the logging level for its test build. 2. Apply https://github.com/openshift/builder/pull/358 to the builder image. 3. Run the test.
Actual results:
The test always fails (or "fails").
Expected results:
The test passes, unless we broke something somewhere.
Additional info:
Description of problem:
There is an regression issue for ovnkube-trace compatibility. I tried on 4.13.6, the ovnkube-trace binary file can be used on RHEL8.6, only has issue for 'pip3 not available', same to https://issues.redhat.com/browse/OCPBUGS-15914 But on 4.13.7, ovnkube-trace binary file cannot be used on RHEL8.6 any more, with below glibc error: ./ovnkube-trace: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by ./ovnkube-trace) ./ovnkube-trace: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by ./ovnkube-trace)
Version-Release number of selected component (if applicable):
4.13.7
How reproducible:
always
Steps to Reproduce:
1. install OCP4.13.7 2. copy ovnkube-trace binary file from ovnkube-master pod to local $ POD=$(oc get pods -n openshift-ovn-kubernetes -l app=ovnkube-master -o name | head -1 | awk -F '/' '{print $NF}') $ oc cp -n openshift-ovn-kubernetes $POD:/usr/bin/ovnkube-trace ovnkube-trace Defaulted container "northd" out of: northd, nbdb, kube-rbac-proxy, sbdb, ovnkube-master, ovn-dbchecker tar: Removing leading `/' from member names $ chmod +x ovnkube-trace $ ls -l ovnkube-trace -rwxrwxr-x. 1 cloud-user cloud-user 45947136 Sep 14 03:10 ovnkube-trace 3. run ovnkube-trace help $ ./ovnkube-trace -h
Actual results:
$ ./ovnkube-trace -h ./ovnkube-trace: /lib64/libc.so.6: version `GLIBC_2.32' not found (required by ./ovnkube-trace) ./ovnkube-trace: /lib64/libc.so.6: version `GLIBC_2.34' not found (required by ./ovnkube-trace)
Expected results:
ovnkube-trace can be used on RHEL8.6
Additional info:
Description of problem:
whereabouts reconciler is responsible for reclaiming dangling IPs, and freeing them to be available to allocate to new pods. This is crucial for scenarios where the amount of addresses are limited and dangling IPs prevent whereabouts from successfully allocating new IPs to new pods. The reconciliation schedule is currently hard-coded to run once a day, without a user-friendly way to configure.
Version-Release number of selected component (if applicable):
How reproducible:
Create a Whereabouts reconciler daemon set, not able to configure the reconciler schedule.
Steps to Reproduce:
1. Create a Whereabouts reconciler daemonset instructions: https://docs.openshift.com/container-platform/4.14/networking/multiple_networks/configuring-additional- network.html#nw-multus-creating-whereabouts-reconciler-daemon-set_configuring-additional-network 2. Run `oc get pods -n openshift-multus | grep whereabouts-reconciler` 3. Run `oc logs whereabouts-reconciler-xxxxx`
Actual results:
You can't configure the cron-schedule of the reconciler.
Expected results:
Be able to modify the reconciler cron schedule.
Additional info:
The fix for this bug is in two places: whereabouts, and cluster-network-operator. From this reason, in order to verify correctly we need to use both fixed components. Please read below for more details about how to apply the new configurations.
How to Verify:
Create a whereabouts-config ConfigMap with a custom value, and check in the whereabouts-reconciler pods' logs that it is updated, and triggering the clean up.
Steps to Verify:
1. Create a Whereabouts reconciler daemonset 2. Wait for the whereabouts-reconciler pods to be running. (takes time for the daemonset to get created). 3. See in logs: "[error] could not read file: <nil>, using expression from flatfile: 30 4 * * *" This means it uses the hardcoded default value. (Because no ConfigMap yet) 4. Run: oc create configmap whereabouts-config -n openshift-multus --from-literal=reconciler_cron_expression="*/2 * * * *" 5. Check in the logs for: "successfully updated CRON configuration" 6. Check that in the next 2 minutes the reconciler runs: "[verbose] starting reconciler run"
Description of problem:
GCP CCM should be using granular permissions rather then pre-defined roles.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-27959. The following is the description of the original issue:
—
In a CI run of etcd-operator-e2e I've found the following panic in the operator logs:
E0125 11:04:58.158222 1 health.go:135] health check for member (ip-10-0-85-12.us-west-2.compute.internal) failed: err(context deadline exceeded) panic: send on closed channel goroutine 15608 [running]: github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth.func1() github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:58 +0xd2 created by github.com/openshift/cluster-etcd-operator/pkg/etcdcli.getMemberHealth github.com/openshift/cluster-etcd-operator/pkg/etcdcli/health.go:54 +0x2a5
which unfortunately is an incomplete log file. The operator recovered itself by restarting, we should fix the panic nonetheless.
Job run for reference:
https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1186/pull-ci-openshift-cluster-etcd-operator-master-e2e-operator/1750466468031500288
Currently when creating an Azure cluster, only the first node of the nodePool will be ready and join the cluster, all other azure machines are stuck in the `Creating` state.
Please review the following PR: https://github.com/openshift/machine-api-provider-openstack/pull/84
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/installer/pull/7496
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When deploying to a Power VS workspace created after February 14th 2024, it will not be found by the installer.
Version-Release number of selected component (if applicable):
How reproducible:
Easily.
Steps to Reproduce:
1. Create a Power VS Workspace 2. Specify it in the install config 3. Attempt to deploy 4. Fail with "...is not a valid guid" error.
Actual results:
Failure to deploy to service instance
Expected results:
Should deploy to service instance
Additional info:
Description of problem:
$ oc get co machine-config NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE machine-config 4.16.0-0.ci-2024-03-01-110656 False False True 2m56s Cluster not available for [{operator 4.16.0-0.ci-2024-03-01-110656}]: MachineConfigNode.machineconfiguration.openshift.io "ip-10-0-24-212.us-east-2.compute.internal" is invalid: [metadata.ownerReferences.apiVersion: Invalid value: "": version must not be empty, metadata.ownerReferences.kind: Invalid value: "": kind must not be empty] MCO operator is failing with this error: 218", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MachineConfigNodeFailed' Cluster not available for [{operator 4.16.0-0.ci-2024-03-01-110656}]: MachineConfigNode.machineconfiguration.openshift.io "ip-10-0-24-212.us-east-2.compute.internal" is invalid: [metadata.ownerReferences.apiVersion: Invalid value: "": version must not be empty, metadata.ownerReferences.kind: Invalid value: "": kind must not be empty] I0301 17:19:12.823035 1 event.go:364] Event(v1.ObjectReference{Kind:"", Namespace:"openshift-machine-config-operator", Name:"machine-config", UID:"c1bad7e7-26ff-47fb-8a2d-a0c03c04d218", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'OperatorDegraded: MachineConfigNodeFailed' Failed to resync 4.16.0-0.ci-2024-03-01-110656 because: MachineConfigNode.machineconfiguration.openshift.io "ip-10-0-49-207.us-east-2.compute.internal" is invalid: [metadata.ownerReferences.apiVersion: Invalid value: "": version must not be empty, metadata.ownerReferences.kind: Invalid value: "": kind must not be empty]
Version-Release number of selected component (if applicable):
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.ci-2024-03-01-110656 True False 17m Error while reconciling 4.16.0-0.ci-2024-03-01-110656: the cluster operator machine-config is not available
How reproducible:
Always
Steps to Reproduce:
1. Enable techpreview oc patch featuregate cluster --type=merge -p '{"spec":{"featureSet": "TechPreviewNoUpgrade"}}'
Actual results:
machine-config CO is degraded
Expected results:
machine-config CO should not be degraded, no error should happen in MCO operator pod
Additional info:
Description of problem:
The way CCM is deployed, it gets the kubeconfig configuration from the environment it runs on, which is the Management cluster. Thus, it communicates with the Kubernetes Api Server (KAS) of the Management Cluster (MC) instead of the KAS of the Hosted Cluster it is part of.
Version-Release number of selected component (if applicable):
4.15.0
How reproducible:
100%
Steps to Reproduce:
1. Deploy a hosted cluster 2. oc debug to the node running the HC CCM 3. crictl ps -a to list all the containers 4. crictl inspect X # Where X is the container id of the CCM container 5. nsenter -n -t pid_of_ccm_container 6. tcpdump
Actual results:
Communication goes to MC KAS
Expected results:
Communication goes to HC KAS
Additional info:
Please review the following PR: https://github.com/openshift/machine-api-provider-aws/pull/94
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
YAML tab will crash on some specific browser versions when MCE is installed
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-02-21-153906
How reproducible:
Always
Steps to Reproduce:
1. Install multicluster engine for Kubernetes, create MultiClusterEngine instance and wait until mce plugin is successfully enabled 2. Check resources YAML tab, for example Deployment creation page /k8s/ns/yapei/deployments/~new/form DeploymentConfig creation page /k8s/ns/yapei/deploymentconfigs/~new/form ConfigMap creation YAML view page /k8s/ns/yapei/configmaps/~new/form Route creation YAML view page /k8s/ns/yapei/routes/~new/form BuildConfig creation YAML view page /k8s/ns/yapei/buildconfigs/~new/form
Actual results:
2. we can see an error page returned when visiting these pages TypeErrorDescription:e is undefined
Expected results:
2. no error and page correctly loaded
Additional info:
using metal-ipi on 4.14 the cluster is failing to come up,
the network cluster-operator is failing to start, the sdn pod shows the error
bash: RHEL_VERSION: unbound variable
Description of problem:
Since the golang.org/x/oauth2 package has been upgraded, GCP installs have been failing with level=info msg=Credentials loaded from environment variable "GOOGLE_CLOUD_KEYFILE_JSON", file "/var/run/secrets/ci.openshift.io/cluster-profile/gce.json" level=error msg=failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: [platform.gcp.project: Internal error: failed to create cloud resource service: Get "http://169.254.169.254/computeMetadata/v1/universe/universe_domain": dial tcp 169.254.169.254:80: connect: connection refused, : Internal error: failed to create compute service: Get "http://169.254.169.254/computeMetadata/v1/universe/universe_domain": dial tcp 169.254.169.254:80: connect: connection refused]
Version-Release number of selected component (if applicable):
4.16/master
How reproducible:
always
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The bump has been introduced by https://github.com/openshift/installer/pull/8020
This is going to block the next payload, it failed 10/10 runs, payload is https://amd64.ocp.releases.ci.openshift.org/releasestream/4.15.0-0.nightly/release/4.15.0-0.nightly-2023-11-30-112918
Suspect
https://github.com/openshift/machine-config-operator/pull/3965/files
OVN-IC doesn't use RAFT and doesn't need to wait a while for the cluster to converge. So we don't need the 90s delay for the readiness probe on the NB and SB containers anymore.
I think we only want to do this for multi-zone-interconnect though since the other deployment types would still use some RAFT.
This is a clone of issue OCPBUGS-25780. The following is the description of the original issue:
—
Description of problem:
When there is new update for cluster, try to click "Select a version" from cluster settings page, there is no reaction.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-19-033450
How reproducible:
Always
Steps to Reproduce:
1.Prepare a cluster with available update. 2.Go to Cluster Settings page, choose a version by clicking on "Select a version" button. 3.
Actual results:
2. There is no response when click on the button, user could not select a version from the page.
Expected results:
2. A modal should show up for user to select version after clicking on "Select a version" button
Additional info:
screenshot: https://drive.google.com/file/d/1Kpyu0kUKFEQczc5NVEcQFbf_uly_S60Y/view?usp=sharing
We are going to be using request serving isolation mode in ROSA. We need an e2e test that helps us to not break that function as we continue HyperShift development.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
It was seen in downstream and upstream that ovn-controller was constantly restarting. This was due to ovnkube-node telling it to exit after it thought that the encap IP (the primary node IP) had changed. This has been mitigated by: https://github.com/ovn-org/ovn-kubernetes/pull/3711 But we still need to know why the c.nodePrimaryAddrChanged() function is returning true when nothing is really changing on the node. Example after the fix above: ovn-control-plane/ovn-kubernetes/ovnkube.log:I0627 22:37:02.020612 1670 node_ip_handler_linux.go:212] Node primary address changed to 172.18.0.3. Updating OVN encap IP. ovn-control-plane/ovn-kubernetes/ovnkube.log:I0627 22:37:02.037852 1670 node_ip_handler_linux.go:343] Will not update encap IP, value: 172.18.0.3 is the already configured ovn-control-plane/ovn-kubernetes/ovnkube.log:I0627 23:03:03.115881 16698 node_ip_handler_linux.go:212] Node primary address changed to 172.18.0.3. Updating OVN encap IP. ovn-control-plane/ovn-kubernetes/ovnkube.log:I0627 23:03:03.122365 16698 node_ip_handler_linux.go:343] Will not update encap IP, value: 172.18.0.3 is the already configured ovn-control-plane/ovn-kubernetes/ovnkube.log:I0627 23:18:08.381694 27220 node_ip_handler_linux.go:212] Node primary address changed to 172.18.0.3. Updating OVN encap IP. ovn-control-plane/ovn-kubernetes/ovnkube.log:I0627 23:18:08.389655 27220 node_ip_handler_linux.go:343] Will not update encap IP, value: 172.18.0.3 is the already configured ovn-control-plane/ovn-kubernetes/ovnkube.log:I0627 23:19:26.638221 28746 node_ip_handler_linux.go:212] Node primary address changed to 172.18.0.3. Updating OVN encap IP. ovn-control-plane/ovn-kubernetes/ovnkube.log:I0627 23:19:26.644217 28746 node_ip_handler_linux.go:343] Will not update encap IP, value: 172.18.0.3 is the already configured This can be observed in kind deployments as well.
Version-Release number of selected component (if applicable):
Could affect versions earlier than 4.14
Please review the following PR: https://github.com/openshift/image-customization-controller/pull/99
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This PR fixed a bug related to the nginx default assets directory in the Dockerfile . This was not backported to 4.15, which causes OCP consoles launched with ci images, like cluster bot to fail to display the observe menu. Backporting fixes the issue for 4.15 ci images.
This is a clone of issue OCPBUGS-29088. The following is the description of the original issue:
—
Description of problem:
Customer has no method to revoke break-glass signer certificate for HCP.
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
always
Steps to Reproduce:
1. not possible
Actual results:
nothing
Expected results:
expected a path to do this
Additional info:
In order to use the new flow introduced to fix this, create a CertificateRevocationRequest in the namespace of a HostedControlPlane as described in the test:
Please review the following PR: https://github.com/openshift/sdn/pull/574
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
OCP upgrade blocks because of cluster operator csi-snapshot-controller fails to start its deployment with a fatal message of read-only filesystem
Version-Release number of selected component (if applicable):
Red Hat OpenShift 4.11 rhacs-operator.v3.72.1
How reproducible:
At least once in user's cluster while upgrading
Steps to Reproduce:
1. Have a OCP 4.11 installed 2. Install ACS on top of the OCP cluster 3. Upgrade OCP to the next z-stream version
Actual results:
Upgrade gets blocked: waiting on csi-snapshot-controller
Expected results:
Upgrade should succeed
Additional info:
stackrox SCCs (stackrox-admission-control, stackrox-collector and stackrox-sensor) contain the `readOnlyRootFilesystem` set to `true`, if not explicitly defined/requested, other Pods might receive this SCC which will make the deployment to fail with a `read-only filesystem` message
Issue 49 from https://docs.google.com/spreadsheets/d/1TR3ENY-GE_LQL9F-xH6NahtRHdu6IrYGhLqDDA8y0EI/edit#gid=1035185624
Form error is missing when import a container image while the import from Git form shows an error correctly.
Screenshot: https://drive.google.com/file/d/1aUfUefnF3IxVzNjn7D3Q05pK9z4prVtN/view?usp=drive_link
Description of problem:
ovn-ipsec pods Crashes when IPSec NS extension/svc is enabled on any $ROLE nodes IPSec ext and svc were enabled for 2 WORKERS only and their corresponding ovn-ipsec pods are in CLBO [root@dell-per740-36 ipsec]# oc get pods NAME READY STATUS RESTARTS AGE dell-per740-14rhtsengpek2redhatcom-debug 1/1 Running 0 3m37s ovn-ipsec-bptr6 0/1 CrashLoopBackOff 26 (3m58s ago) 130m ovn-ipsec-bv88z 1/1 Running 0 3h5m ovn-ipsec-pre414-6pb25 1/1 Running 0 3h5m ovn-ipsec-pre414-b6vzh 1/1 Running 0 3h5m ovn-ipsec-pre414-jzwcm 1/1 Running 0 3h5m ovn-ipsec-pre414-vgwqx 1/1 Running 3 132m ovn-ipsec-pre414-xl4hb 1/1 Running 3 130m ovn-ipsec-qb2bj 1/1 Running 0 3h5m ovn-ipsec-r4dfw 1/1 Running 0 3h5m ovn-ipsec-xhdpw 0/1 CrashLoopBackOff 28 (116s ago) 132m ovnkube-control-plane-698c9845b8-4v58f 2/2 Running 0 3h5m ovnkube-control-plane-698c9845b8-nlgs8 2/2 Running 0 3h5m ovnkube-control-plane-698c9845b8-wfkd4 2/2 Running 0 3h5m ovnkube-node-l6sr5 8/8 Running 27 (66m ago) 130m ovnkube-node-mj8bs 8/8 Running 27 (75m ago) 132m ovnkube-node-p24x8 8/8 Running 0 178m ovnkube-node-rlpbh 8/8 Running 0 178m ovnkube-node-wdxbg 8/8 Running 0 178m [root@dell-per740-36 ipsec]#
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-09-12-024050
How reproducible:
Always
Steps to Reproduce:
1.Install OVN IPSec cluster (East-West) 2.Enable IPSec OS extension for North-South 3.Enable IPSec service for North-South
Actual results:
ovn-ipsec pods in CLBO state
Expected results:
All pods under ovn-kubernetes ns should be Running fine
Additional info:
One of the ovn-ipsec CLBO pods logs # oc logs ovn-ipsec-bptr6 Defaulted container "ovn-ipsec" out of: ovn-ipsec, ovn-keys (init) + rpm --dbpath=/usr/share/rpm -q libreswan libreswan-4.9-4.el9_2.x86_64 + counter=0 + '[' -f /etc/cni/net.d/10-ovn-kubernetes.conf ']' + echo 'ovnkube-node has configured node.' ovnkube-node has configured node. + ip x s flush + ip x p flush + ulimit -n 1024 + /usr/libexec/ipsec/addconn --config /etc/ipsec.conf --checkconfig + /usr/libexec/ipsec/_stackmanager start + /usr/sbin/ipsec --checknss + /usr/libexec/ipsec/pluto --leak-detective --config /etc/ipsec.conf --logfile /var/log/openvswitch/libreswan.log FATAL ERROR: /usr/libexec/ipsec/pluto: lock file "/run/pluto/pluto.pid" already exists leak: string logger, item size: 48 leak: string logger prefix, item size: 27 leak detective found 2 leaks, total size 75 journalctl -u ipsec here: https://privatebin.corp.redhat.com/?216142833d016b3c#2Es8ACSyM3VWvwi85vTaYtSx8X3952ahxCvSHeY61UtT
Description of problem:
When the network type is Calico for a hosted cluster, the rbac policies that are laid down for CNO do not include permissions to deploy network-node-identity
Version-Release number of selected component (if applicable):
How reproducible: IBM Satellite environment
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The goal is to collect metrics about RHACS installations to capture billing and and overall usage metrics for the product. We would also like to request a backport of the telemeter config to existing OpenShift cluster versions such that telemetry metrics become available sooner as they provide critical information to our product management.
Central is the main backend component of RHACS ("hub"). The metrics shows installation info about Central, as well as usage data via three gauges (secured clusters, secured nodes, secured vCPU). This is a recording rule where unnecessary labels like instance and job have already been removed.
Labels
Sensor is a component installed on clusters managed by RHACS. The metrics shows installation info about Sensor, as well as usage data via two gauges (secured nodes, secured vCPU). The cardinality of the metric series is 1. This is a recording rule where unnecessary labels like instance and job have already been removed.
The cardinality of the metrics per cluster is 1.
Description of problem:
The issue was found in ci, and it is an Azure private cluster, all the egressIP cases failed due to EgressIP cannot be applied to the egress node. It was able to be reproduced manually.
Version-Release number of selected component (if applicable):
4.12.0-0.nightly-2023-01-08-142418
How reproducible:
Always
Steps to Reproduce:
1. Label one worker node as egress node 2. Create one egressIP object 3.
Actual results:
% oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-2 10.0.1.10 egressip-47164 10.0.1.217 % oc get cloudprivateipconfig NAME AGE 10.0.1.10 18m 10.0.1.217 22m % oc get cloudprivateipconfig -o yaml apiVersion: v1 items: - apiVersion: cloud.network.openshift.io/v1 kind: CloudPrivateIPConfig metadata: annotations: k8s.ovn.org/egressip-owner-ref: egressip-2 creationTimestamp: "2023-01-09T10:11:33Z" finalizers: - cloudprivateipconfig.cloud.network.openshift.io/finalizer generation: 1 name: 10.0.1.10 resourceVersion: "59723" uid: d697568a-7d7c-471a-b5e1-d7b814244549 spec: node: huirwang-0109b-bv4ld-worker-eastus1-llmpb status: conditions: - lastTransitionTime: "2023-01-09T10:17:06Z" message: 'Error processing cloud assignment request, err: network.InterfacesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="OutboundRuleCannotBeUsedWithBackendAddressPoolThatIsReferencedBySecondaryIpConfigs" Message="OutboundRule /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/huirwang-0109b-bv4ld-rg/providers/Microsoft.Network/loadBalancers/huirwang-0109b-bv4ld/outboundRules/outbound-rule-v4 cannot be used with Backend Address Pool /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/huirwang-0109b-bv4ld-rg/providers/Microsoft.Network/loadBalancers/huirwang-0109b-bv4ld/backendAddressPools/huirwang-0109b-bv4ld that contains Secondary IPConfig /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/huirwang-0109b-bv4ld-rg/providers/Microsoft.Network/networkInterfaces/huirwang-0109b-bv4ld-worker-eastus1-llmpb-nic/ipConfigurations/huirwang-0109b-bv4ld-worker-eastus1-llmpb_10.0.1.10" Details=[]' observedGeneration: 1 reason: CloudResponseError status: "False" type: Assigned node: huirwang-0109b-bv4ld-worker-eastus1-llmpb - apiVersion: cloud.network.openshift.io/v1 kind: CloudPrivateIPConfig metadata: annotations: k8s.ovn.org/egressip-owner-ref: egressip-47164 creationTimestamp: "2023-01-09T10:07:56Z" finalizers: - cloudprivateipconfig.cloud.network.openshift.io/finalizer generation: 1 name: 10.0.1.217 resourceVersion: "58333" uid: 6a7d6196-cfc9-4859-9150-7371f5818b74 spec: node: huirwang-0109b-bv4ld-worker-eastus1-llmpb status: conditions: - lastTransitionTime: "2023-01-09T10:13:29Z" message: 'Error processing cloud assignment request, err: network.InterfacesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: Code="OutboundRuleCannotBeUsedWithBackendAddressPoolThatIsReferencedBySecondaryIpConfigs" Message="OutboundRule /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/huirwang-0109b-bv4ld-rg/providers/Microsoft.Network/loadBalancers/huirwang-0109b-bv4ld/outboundRules/outbound-rule-v4 cannot be used with Backend Address Pool /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/huirwang-0109b-bv4ld-rg/providers/Microsoft.Network/loadBalancers/huirwang-0109b-bv4ld/backendAddressPools/huirwang-0109b-bv4ld that contains Secondary IPConfig /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/huirwang-0109b-bv4ld-rg/providers/Microsoft.Network/networkInterfaces/huirwang-0109b-bv4ld-worker-eastus1-llmpb-nic/ipConfigurations/huirwang-0109b-bv4ld-worker-eastus1-llmpb_10.0.1.217" Details=[]' observedGeneration: 1 reason: CloudResponseError status: "False" type: Assigned node: huirwang-0109b-bv4ld-worker-eastus1-llmpb kind: List metadata: resourceVersion: ""
Expected results:
EgressIP can be applied correctly
Additional info:
Description of problem:
The HostedCluster name is not currently validated against RFC1123.
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1. 2. 3.
Actual results:
Any HostedCluster name is allowed
Expected results:
Only HostedCluster names meeting RFC1123 validation should be allowed.
Additional info:
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/105
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-28787. The following is the description of the original issue:
—
Description of problem:
It was found when testing OCP-71263 and regression OCP-35770 for 4.15. For GCP in Mint mode, the root credential can be removed after cluster installation. But after removing the root credential, CCO became degrade.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-25-051548 4.15.0-rc.3
How reproducible:
Always
Steps to Reproduce:
1.Install a GCP cluster with Mint mode 2.After install, remove the root credential jianpingshu@jshu-mac ~ % oc delete secret -n kube-system gcp-credentials secret "gcp-credentials" deleted 3.Wait some time(about 1/2h to 1h), CCO became degrade jianpingshu@jshu-mac ~ % oc get co cloud-credential NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE cloud-credential 4.15.0-rc.3 True True True 6h45m 6 of 7 credentials requests are failing to sync. jianpingshu@jshu-mac ~ % oc -n openshift-cloud-credential-operator get -o json credentialsrequests | jq -r '.items[] | select(tostring | contains("InfrastructureMismatch") | not) | .metadata.name as $n | .status.conditions // [{type: "NoConditions"}] | .[] | .type + "=" + .status + " " + $n + " " + .reason + ": " + .message' | sort CredentialsProvisionFailure=False openshift-cloud-network-config-controller-gcp CredentialsProvisionSuccess: successfully granted credentials request CredentialsProvisionFailure=True cloud-credential-operator-gcp-ro-creds CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found CredentialsProvisionFailure=True openshift-gcp-ccm CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found CredentialsProvisionFailure=True openshift-gcp-pd-csi-driver-operator CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found CredentialsProvisionFailure=True openshift-image-registry-gcs CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found CredentialsProvisionFailure=True openshift-ingress-gcp CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found CredentialsProvisionFailure=True openshift-machine-api-gcp CredentialsProvisionFailure: failed to grant creds: unable to fetch root cloud cred secret: Secret "gcp-credentials" not found openshift-cloud-network-config-controller-gcp has no failure because it doesn't has customized role in 4.15.0.rc3
Actual results:
CCO became degrade
Expected results:
CCO not in degrade, just "upgradeable" condition updated with missing the root credential
Additional info:
Tested the same case on 4.14.10, no issue
Description of the problem:
In staging, BE 2.23.0 - adding API and Ingress VIPs manually, and then change network to UMN,. BE response with an error "User Managed Networking cannot be set with API VIP"
Had a talk with Nir Magnezi about this. we should add ability to BE to delete VIPs from DB, if api gets such a request
This is in continue to
How reproducible:
Steps to reproduce:
1. add api and ingress VIPs manually
2. Change network to UMN
3.
Actual results:
Expected results:
Description of problem:
configure-ovs.sh breaks primary interface config by leaving generated configs in '/etc/NetworkManager/system-connections`
Version-Release number of selected component (if applicable):
4.10.52 -> 4.11.46 -> OCP 4.12.27 IPI VSphere
How reproducible:
reboot any node, the node will never become ready.
Steps to Reproduce:
1. Install and upgrade cluster 2. Reboot worker nodes after upgrade. 3.
Actual results:
Primary interface never sends DHCP and bad configs in /etc/NetworkManager/system-connections
Expected results:
No left over ovs-configure configs, and primary interface aquires IP Address using DHCP.
Additional info:
Workaround Only when using a single DHCP interface. rm /etc/NetworkManager/system-connections/*
Metal team has filed: OCPBUGS-24328
Seems to be permafailing for several days now. First payload https://amd64.ocp.releases.ci.openshift.org/releasestream/4.15.0-0.nightly/release/4.15.0-0.nightly-2023-11-30-112918
Failure to bootstrap is quite hard to decipher for us.
Description of problem:
console does not enable customizing the abbreviation that appears on the resource icon badge. This causes an issue for the FAR operator with the CRD FenceAgentRemediationTemplate, the badge icon shows FART. The CRD includes a custom short name, but the console ignores it
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. create the CRD (included link to github) 2. navigate to Home -> search 3. Enter far into the Resources filter
Actual results:
The badge FART shows in the dropdown
Expected results:
The badge should show fartemplate - the content of the short name
Additional info:
Description of problem:
Looking at the vSphere connection configuration via UI we can see that the value for VCenter cluster is populated with the "networks" value instead of the "computeCluster" one
Additional info:
- https://github.com/openshift/console/blob/fdcd7738612cd5685c100b15d348134c96b2fa39[...]ackages/vsphere-plugin/src/components/VSphereConnectionForm.tsx - https://github.com/openshift/console/blob/fdcd7738612cd5685c100b15d348134c96b2fa39/frontend/packages/vsphere-plugin/src/hooks/use-connection-form.ts#L69 From the form query it seems it is linked to the Network: ====================================== vCenterCluster = domain?.topology?.networks?.[0] || ''; ====================================== Our understanding it that it should pickup the cluster name: ====================================== topology.computeCluster ======================================
Description of problem:
In ROSA/OCP 4.14.z, attaching AmazonEC2ContainerRegistryReadOnly policy to the worker nodes (in ROSA's case, this was attached to the ManagedOpenShift-Worker-Role, which is assigned by the installer to all the worker nodes), has no effect on ECR Image pull. User gets an authentication error. Attaching the policy ideally should avoid the need to provide an image-pull-secret. However, the error is resolved only if the user also provides an image-pull-secret. This is proven to work correctly in 4.12.z. Seems something has changed in the recent OCP versions.
Version-Release number of selected component (if applicable):
4.14.2 (ROSA)
How reproducible:
The issue is reproducible using the below steps.
Steps to Reproduce:
1. Create a deployment in ROSA or OCP on AWS, pointing at a private ECR repository 2. The image pulling will fail with Error: ErrImagePull & authentication required errors 3.
Actual results:
The image pull fails with "Error: ErrImagePull" & "authentication required" errors. However, the image pull is successful only if the user provides an image-pull-secret to the deployment.
Expected results:
The image should be pulled successfully by virtue of the ECR-read-only policy attached to the worker node role; without needing an image-pull-secret.
Additional info:
In other words:
in OCP 4.13 (and below) if a user adds the ECR:* permissions to the worker instance profile, then the user can specify ECR images and authentication of the worker node to ECR is done using the instance profile. In 4.14 this no longer works.
It is not sufficient as an alternative, to provide a pull secret in a deployment because AWS rotates ECR tokens every 12 hours. That is not a viable solution for customers that until OCP 4.13, did not have to rotate pull secrets constantly.
The experience in 4.14 should be the same as in 4.13 with ECR.
The current AWS policy that's used is this one: `arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly`
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ecr:GetAuthorizationToken", "ecr:BatchCheckLayerAvailability", "ecr:GetDownloadUrlForLayer", "ecr:GetRepositoryPolicy", "ecr:DescribeRepositories", "ecr:ListImages", "ecr:DescribeImages", "ecr:BatchGetImage", "ecr:GetLifecyclePolicy", "ecr:GetLifecyclePolicyPreview", "ecr:ListTagsForResource", "ecr:DescribeImageScanFindings" ], "Resource": "*" } ] }
Description of problem:
developer console, go to "Observe -> openshift-moniotring -> Alerts", silence Watchdog alert, at the first, the alert state is Silenced in Alerts tab, but changed to Firing quickly(the alert is silenced actually), see the attached screen shoot
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-09-02-132842
How reproducible:
always
Steps to Reproduce:
1. silence alert in the dev console, and check alert state in Alerts tab 2. 3.
Actual results:
alert state is changed from Silenced to Firing quickly
Expected results:
state should be Silenced
Please review the following PR: https://github.com/openshift/cluster-api-provider-gcp/pull/200
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem: the per-node certificates should be a configurable duration
Description of problem:
The oc login --web command fails when used with a Hypershift Guest Cluster. The web console returns an error message stating that the client is unauthorized to request a token using this method. Error Message: { "error": "unauthorized_client", "error_description": "The client is not authorized to request a token using this method." } OCP does not have such issue.
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-11-21-212406 4.14 4.15
How reproducible:
always
Steps to Reproduce:
1.Install a Hypershift Guest Cluster. 2. Configure the Any OpenID Identity Provider for the Hypershift Guest Cluster eg. https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-62511 3. Execute the oc login --web $URL command. 4. After adding openshift-cli-client manually it's works # cat oauth.yaml apiVersion: oauth.openshift.io/v1 grantMethod: auto kind: OAuthClient metadata: name: openshift-cli-client redirectURIs: - http://127.0.0.1/callback,http://[::1]/callback respondWithChallenges: false # oc create -f oauth.yaml oauthclient.oauth.openshift.io/openshift-cli-client created $ oc login --web $URL Opening login URL in the default browser: https://oauth-clusters-hypershift-ci-28276.apps.xxxxxxxxxxxxxxxx.com:443/oauth/authorize?client_id=openshift-cli-client&code_challenge=mixnB73nR_yzL58e0lEd4soQH1sn0GjvWEfnX4PNrCg&code_challenge_method=S256&redirect_uri=http%3A%2F%2F127.0.0.1%3A45055%2Fcallback&response_type=code Login successful.
Actual results:
Step 3: The web login process fails and redirects to an error page displaying the error message "error_description": "The client is not authorized to request a token using this method."
Expected results:
OAuthClient 'openshift-cli-client' should not be missing for HyperShift Guest Clusters so that the oc login --web $URL command should work without any issues. As OCP 4.13+ has the OAuthClient 'openshift-cli-client' by default.
Additional info:
The issue can be tracked at the following URL: https://issues.redhat.com/browse/AUTH-444
Root Cause :
Default 'openshift-cli-client' OAuthClient should not be missing for HyperShift Guest Clusters.
Dockerfile.okd is behind compared to Dockerfile
Description of problem:
The ServiceAccounts for both in-cluster and UWM alertmanager set autoMountServiceAccountToken: true.
This should be improved and set at the pod level. Hence this will require a change in prometheus-operator and its configuration of Alertmanager pods.
A similar change for Prometheus pods was implemented in https://github.com/prometheus-operator/prometheus-operator/pull/4514.
General code cleanup and improvement
Issue 58 from https://docs.google.com/spreadsheets/d/1TR3ENY-GE_LQL9F-xH6NahtRHdu6IrYGhLqDDA8y0EI/edit#gid=1035185624
Quickstarts catalog item count isn't vertical aligned anymore
Screenshot: https://drive.google.com/file/d/1hxh5VI2S7jLKRdNlDQsdlAXL_G7TxtME/view?usp=sharing
Please review the following PR: https://github.com/openshift/cluster-openshift-controller-manager-operator/pull/304
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-ingress-operator/pull/977
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
E1106 21:44:31.805740 18 apiaccess_count_controller.go:168] APIRequestCount.apiserver.openshift.io "nodes.v1" is invalid: [status.currentHour.byNode[0].byUser: Too many: 708: must have at most 500 items, status.last24h[21].byNode[0].byUser: Too many: 708: must have at most 500 items]
seen in a large-scale test; 750 nodes, 180,000 pods, 90,000 services, pods/services being created at 20 objects/second.
https://redhat-internal.slack.com/archives/CB48XQ4KZ/p1699307146216599
Luis Sanchez said "Just confirmed that under certain circumstances, the .spec.numberOfUsersToReport field is not being applied correctly. Open a bug please."
Since many 4.y ago, before 4.11 and all the minor versions that are still supported, CRI-O has wiped images when it comes up after a node reboot and notices it has a new (minor?) version. This causes redundant pulls, as seen in this 4.11-to-4.12 update run:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-upgrade-from-stable-4.11-e2e-azure-sdn-upgrade/1732741139229839360/artifacts/e2e-azure-sdn-upgrade/gather-extra/artifacts/nodes/ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4/journal | zgrep 'Starting update from rendered-\|crio-wipe\|Pulled image: registry.ci.openshift.org/ocp/4.12-2023-12-07-060628@sha256:3c3e67faf4b6e9e95bebb0462bd61c964170893cb991b5c4de47340a2f295dc2' Dec 07 13:05:42.474144 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 systemd[1]: crio-wipe.service: Succeeded. Dec 07 13:05:42.481470 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 systemd[1]: crio-wipe.service: Consumed 191ms CPU time Dec 07 13:59:51.000686 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 crio[1498]: time="2023-12-07 13:59:51.000591203Z" level=info msg="Pulled image: registry.ci.openshift.org/ocp/4.12-2023-12-07-060628@sha256:3c3e67faf4b6e9e95bebb0462bd61c964170893cb991b5c4de47340a2f295dc2" id=a62bc972-67d7-401a-9640-884430bd16f1 name=/runtime.v1.ImageService/PullImage Dec 07 14:00:55.745095 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 root[101294]: machine-config-daemon[99469]: Starting update from rendered-worker-ca36a33a83d49b43ed000fd422e09838 to rendered-worker-c0b3b4eadfe6cdfb595b97fa293a9204: &{osUpdate:true kargs:false fips:false passwd:false files:true units:true kernelType:false extensions:false} Dec 07 14:05:33.274241 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 systemd[1]: crio-wipe.service: Succeeded. Dec 07 14:05:33.289605 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 systemd[1]: crio-wipe.service: Consumed 216ms CPU time Dec 07 14:14:50.277011 ci-op-rzrplpjd-7f65d-vwrzs-worker-eastus21-lcgk4 crio[1573]: time="2023-12-07 14:14:50.276961087Z" level=info msg="Pulled image: registry.ci.openshift.org/ocp/4.12-2023-12-07-060628@sha256:3c3e67faf4b6e9e95bebb0462bd61c964170893cb991b5c4de47340a2f295dc2" id=1a092fbd-7ffa-475a-b0b7-0ab115dbe173 name=/runtime.v1.ImageService/PullImage
The redundant pulls cost network and disk traffic, and avoiding them should make those update-initiated reboots quicker and cheaper. The lack of update-initiated wipes is not expected to cost much, because the Kubelet's old-image garbage collection should be along to clear out any no-longer-used images if disk space gets tight.
At least 4.11. Possibly older 4.y; I haven't checked.
Every time.
1. Install a cluster.
2. Update to a release image with a different CRI-O (minor?) version.
3. Check logs on the nodes.
crio-wipe entries in the logs, with reports of target-release images being pulled before and after those wipes, as I quoted in the Description.
Target-release images pulled before the reboot, and found in the local cache if that image is needed again post-reboot.
Description of problem:
When upgrading cluster from 4.13.23 to 4.14.3, machine-config CO gets stuck due to a content mismatch error on all nodes. Node node-xxx-xxx is reporting: "unexpected on-disk state validating against rendered-master-734521b50f69a1602a3a657419ed4971: content mismatch for file \"/etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt\""
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. perform a upgrade from 4.13.x to 4.14.x 2. 3.
Actual results:
machine-config stalls during upgrade
Expected results:
the "content mismatch" shouldn't happen anymore according to the MCO engineering team
Additional info:
We heavily rely on scripts located in
https://github.com/dougsland/bz-query
in order to assign Jiras to members of the SDN team.
as a person in charge of knowing the bug load on each of our developers to decide
who is the best person to own un-assigned Jiras, we should have the scripts in a more fomal location.
Description of problem:
An assisted-service fix https://issues.redhat.com//browse/MGMT-15340, resolved an issue in the nmstateconfig scripts to ensure VLAN names are < 15 characters. This same fix needs to be merged to the agent installer.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Create an agent image with static networking using a vlan with a long name (greater than 15 characters) 2. Boot a host with the agent image
Actual results:
The installation will fail
Expected results:
The installation will pass.
Additional info:
Description of problem:
0000_90_kube-apiserver-operator_04_servicemonitor-apiserver lists Prometheus Rule `kube-apiserver` which is meant to be deleted by CVO (has `release.openshift.io/delete: "true"` annotation). This manifests is no longer needed, as `cluster:apiserver_current_inflight_requests:sum:max_over_time:2m` recording rule is already provided by other PrometheusRules. If this is meant to be removed in 4.13, its safe to remove the manifest in 4.14, as we don't allow skipping 4.13 and by the time users will start 4.14 update this manifest would already be removed in the clusters by CVO
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Customer created hosted control plane (HCP of type kubevirt) clusters on Hub OCP cluster Now for their workload to pull images on HCP cluster They added auth for our registries to a secret named "scale-rm-pull-secret" in "clusters" namespace in Hub cluster.And then specified this secret "scale-rm-pull-secret" in hostedcluster CR for HCP in question in hub under namespace "clusters" They expect this change to reflect on HCP cluster nodes and images to be pulled successfully. However they keep getting imagepullbackoff error on HCP cluster PodPibm-spectrum-scale-controller-manager-5cb84655b4-dvnxk NamespaceNSibm-spectrum-scale-operator Generated from kubelet on scale-41312-t7nml 2 times in the last 0 minutes Failed to pull image "icr.io/cpopen/ibm-spectrum-scale-operator@sha256:f6138abb5493d7ef6405dcf0a6bb5afc697cca9f20be1a88b3214268b6382da8": rpc error: code = Unknown desc = (Mirrors also failed: [cp.stg.icr.io/cp/ibm-spectrum-scale-operator@sha256:f6138abb5493d7ef6405dcf0a6bb5afc697cca9f20be1a88b3214268b6382da8: Requesting bearer token: invalid status code from registry 400 (Bad Request)] [docker-na-public.artifactory.swg-devops.com/sys-spectrum-scale-team-cloud-native-docker-local/ibm-spectrum-scale-operator@sha256:f6138abb5493d7ef6405dcf0a6bb5afc697cca9f20be1a88b3214268b6382da8: unable to retrieve auth token: invalid username/password: authentication required]): icr.io/cpopen/ibm-spectrum-scale-operator@sha256:f6138abb5493d7ef6405dcf0a6bb5afc697cca9f20be1a88b3214268b6382da8: reading manifest sha256:f6138abb5493d7ef6405dcf0a6bb5afc697cca9f20be1a88b3214268b6382da8 in icr.io/cpopen/ibm-spectrum-scale-operator: manifest unknown Customer is able to pull the image manually using same credentials podman pull docker-na-public.artifactory.swg-devops.com/sys-spectrum-scale-team-cloud-native-docker-local/ibm-spectrum-scale-operator@sha256:f6138abb5493d7ef6405dcf0a6bb5afc697cca9f20be1a88b3214268b6382da8
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Pulled the image manually on nodes successfully after logging to registry with same credentials but pod continues to say can not pull image. ANother thing to note is that pod has imagepullpolicy as "ifnotpresent" so after manual pull on all three nodes also why it continue to throw same error podman pull docker-na-public.artifactory.swg-devops.com/sys-spectrum-scale-team-cloud-native-docker-local/ibm-spectrum-scale-operator@sha256:f6138abb5493d7ef6405dcf0a6bb5afc697cca9f20be1a88b3214268b6382da8 Trying to pull docker-na-public.artifactory.swg-devops.com/sys-spectrum-scale-team-cloud-native-docker-local/ibm-spectrum-scale-operator@sha256:f6138abb5493d7ef6405dcf0a6bb5afc697cca9f20be1a88b3214268b6382da8... Getting image source signatures Copying blob 1e3d9b7d1452 skipped: already exists Copying blob fe5ca62666f0 skipped: already exists Copying blob e8c73c638ae9 skipped: already exists Copying blob fcb6f6d2c998 skipped: already exists Copying blob b02a7525f878 skipped: already exists Copying blob 4aa0ea1413d3 skipped: already exists Copying blob 7c881f9ab25e skipped: already exists Copying blob 5627a970d25e skipped: already exists Copying blob c7e34367abae skipped: already exists Copying blob f92848770344 skipped: already exists Copying blob a7ca0d9ba68f skipped: already exists Copying config 07120ff2fe done Writing manifest to image destination Storing signatures 07120ff2fe00d6335ef757b33546fc9ec9e3d799a500349343f09228bcdf73c0 sh-5.1# PodPibm-spectrum-scale-controller-manager-5cb84655b4-dvnxk NamespaceNSibm-spectrum-scale-operator 21 Sept 2023, 17:58 Generated from kubelet on scale-41312-t7nml 2 times in the last 0 minutes Failed to pull image "icr.io/cpopen/ibm-spectrum-scale-operator@sha256:f6138abb5493d7ef6405dcf0a6bb5afc697cca9f20be1a88b3214268b6382da8": rpc error: code = Unknown desc = (Mirrors also failed: [cp.stg.icr.io/cp/ibm-spectrum-scale-operator@sha256:f6138abb5493d7ef6405dcf0a6bb5afc697cca9f20be1a88b3214268b6382da8: Requesting bearer token: invalid status code from registry 400 (Bad Request)] [docker-na-public.artifactory.swg-devops.com/sys-spectrum-scale-team-cloud-native-docker-local/ibm-spectrum-scale-operator@sha256:f6138abb5493d7ef6405dcf0a6bb5afc697cca9f20be1a88b3214268b6382da8: unable to retrieve auth token: invalid username/password: authentication required]): icr.io/cpopen/ibm-spectrum-scale-operator@sha256:f6138abb5493d7ef6405dcf0a6bb5afc697cca9f20be1a88b3214268b6382da8: reading manifest sha256:f6138abb5493d7ef6405dcf0a6bb5afc697cca9f20be1a88b3214268b6382da8 in icr.io/cpopen/ibm-spectrum-scale-operator: manifest unknown
Description of problem:
Azure Stack Hub doesn't support Azure-file yet (from https://learn.microsoft.com/en-us/azure-stack/user/azure-stack-acs-differences?view=azs-2206), so we should not install Azure-file-CSI-Driver on it.
$ oc get infrastructures cluster -o json | jq .status.platformStatus.azure { "armEndpoint": "https://management.mtcazs.wwtatc.com", "cloudName": "AzureStackCloud", "networkResourceGroupName": "wduan-0516b-ash-rs7gh-rg", "resourceGroupName": "wduan-0516b-ash-rs7gh-rg" } $ oc get clustercsidrivers file.csi.azure.com NAME AGE file.csi.azure.com 45m $ oc get sc azurefile-csi NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE azurefile-csi file.csi.azure.com Delete Immediate true 47m $ oc describe pvc mydep-pvc-02 Warning ProvisioningFailed <invalid> file.csi.azure.com_wduan-0516b-ash-rs7gh-master-1_19c3f203-70a7-4d7f-afcc-22665adff5fe failed to provision volume with StorageClass "azurefile-csi": rpc error: code = Internal desc = failed to ensure storage account: failed to create storage account f0f49c11984fb413a958286, error: &{false 400 0001-01-01 00:00:00 +0000 UTC { "code": "StorageAccountInvalidKind", "message": "The requested storage account kind is invalid in this location.", "target": "StorageAccount" }}
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-05-11-225357
How reproducible:
Always
Steps to Reproduce:
See Description
Actual results:
Azure-file-CSI-Driver is installed on Azure Stack Hub
Expected results:
Azure-file-CSI-Driver should not be installed on Azure Stack Hub
Please review the following PR: https://github.com/openshift/cluster-network-operator/pull/2006
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-28663. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This test triggers failures shortly after node reboot. Of course the node isn't ready, it rebooted.
: [sig-node] nodes should not go unready after being upgraded and go unready only once
{ 1 nodes violated upgrade expectations: Node ci-op-q38yw8yd-8aaeb-lsqxj-master-0 went unready multiple times: 2023-10-11T21:58:45Z, 2023-10-11T22:05:45Z Node ci-op-q38yw8yd-8aaeb-lsqxj-master-0 went ready multiple times: 2023-10-11T21:58:46Z, 2023-10-11T22:07:18Z }Both of those times, the master-0 was rebooted or being rebooted.
Description of problem:
issue is found when verify OCPBUGS-21637, so verbose prometheus-operator-admission-webhook logs
$ oc -n openshift-monitoring get pod -l app.kubernetes.io/name=prometheus-operator-admission-webhook NAME READY STATUS RESTARTS AGE prometheus-operator-admission-webhook-5d96cbcbfc-6lx4m 1/1 Running 0 56m prometheus-operator-admission-webhook-5d96cbcbfc-jj66x 1/1 Running 0 53m $ oc -n openshift-monitoring logs prometheus-operator-admission-webhook-5d96cbcbfc-6lx4m level=info ts=2023-11-06T01:50:33.617049649Z caller=main.go:140 address=[::]:8443 msg="Starting TLS enabled server" http2=false ts=2023-11-06T01:50:34.601774794Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:50:40.439015896Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:50:40.43925044Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:50:50.437745065Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:50:50.448362455Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:51:00.428162615Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:51:00.428571968Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:51:10.426317894Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:51:10.426769416Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:51:20.426701853Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:51:20.427289877Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:51:30.429156675Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:51:30.429229042Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:51:40.426522527Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:51:40.427038656Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:51:50.428974832Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:51:50.429036156Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:52:00.428747039Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:52:00.42880275Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:52:10.426871896Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:52:10.428574666Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:52:20.428211529Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:52:20.428638108Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:52:30.427148775Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:52:30.427631515Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:52:40.427167231Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:52:40.427658789Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:52:50.427851476Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:52:50.428319729Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:53:00.428583783Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:53:00.429083642Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:53:10.426258718Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:53:10.426788637Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:53:20.430876533Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:53:20.431510269Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:53:30.427527316Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:53:30.428046481Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:53:40.428449342Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:53:40.428886681Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:53:50.426513473Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:53:50.427038956Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:54:00.426639171Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:54:00.427164997Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:54:10.426804033Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:54:10.427276217Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:54:20.427705297Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:54:20.428214309Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:54:30.428041006Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:54:30.428525809Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:54:40.426257489Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:54:40.42674803Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:54:50.42708913Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:54:50.427155482Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:55:00.428431788Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:55:00.428881681Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:55:10.429549989Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:55:10.429618004Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:55:20.427741192Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:55:20.428196221Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:55:30.4269946Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:55:30.427451901Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:55:40.426994787Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:55:40.427502475Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:55:50.426456346Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:55:50.426610051Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:56:00.426520596Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:56:00.426676076Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:56:10.435077603Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:56:10.435135319Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:56:20.427693249Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:56:20.428171589Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:56:30.428760772Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:56:30.428828762Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:56:40.428545666Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:56:40.429005303Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:56:50.426103842Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:56:50.426578009Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:57:00.427041793Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:57:00.427482797Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:57:10.427963834Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:57:10.428440451Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:57:20.428877932Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:57:20.428945521Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:57:30.426157935Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:57:30.426639545Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:57:40.42875961Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:57:40.42884264Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:57:50.426450177Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:57:50.426939532Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:58:00.428456873Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:58:00.428904131Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:58:10.428931448Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:58:10.428987646Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:58:20.429377819Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:58:20.4294396Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:58:30.428108184Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:58:30.428580595Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:58:40.426962512Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:58:40.427429076Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:58:50.429177401Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:58:50.429637834Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:59:00.428197981Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:59:00.428655487Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:59:10.426418388Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:59:10.426908577Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:59:20.426705875Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:59:20.427197531Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:59:30.427909675Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:59:30.428395421Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:59:40.429100447Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:59:40.429871853Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:59:50.4268663Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T01:59:50.427329161Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:00:00.429149297Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:00:00.429205811Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:00:10.426857098Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:00:10.427290243Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:00:20.42638474Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:00:20.426901703Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:00:30.428885162Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:00:30.429373666Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:00:40.427093878Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:00:40.427622056Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:00:50.428691098Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:00:50.428743261Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:01:00.426355861Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:01:00.42685464Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:01:10.426208743Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:01:10.426710363Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:01:20.426872491Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:01:20.42731801Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:01:30.426612427Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:01:30.427084214Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:01:40.428796629Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:01:40.429400491Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:01:50.427001992Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:01:50.42827597Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:02:00.428013056Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:02:00.428469744Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:02:10.426711057Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:02:10.427247058Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:02:20.429136255Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:02:20.429208369Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:02:30.427158806Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:02:30.427593326Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:02:40.426389918Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:02:40.426875768Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:02:50.429551365Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:02:50.429619241Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:03:00.426621326Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:03:00.427126079Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:03:10.426301507Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:03:10.426803336Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:03:13.952615577Z caller=stdlib.go:105 caller=server.go:3215 msg="http: TLS handshake error from 10.130.0.1:52552: EOF" ts=2023-11-06T02:03:20.426371089Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:03:20.426852234Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:03:30.428789504Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:03:30.428874536Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:03:40.427028458Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:03:40.427463333Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:03:50.429615112Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:03:50.429679407Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:04:00.4285878Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:04:00.429074488Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:04:10.4279579Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:04:10.428403727Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:04:20.426433063Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:04:20.426940057Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:04:30.428317498Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:04:30.428730147Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:04:40.42911069Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:04:40.429194383Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:04:50.42820753Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:04:50.428643464Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:05:00.427890872Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from main.newSrv.func1 (main.go:173)" ts=2023-11-06T02:05:00.428356508Z caller=stdlib.go:105 caller=server.go:3215 msg="http: superfluous response.WriteHeader call from ...
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-11-04-120954
How reproducible:
always
Steps to Reproduce:
1. check prometheus-operator-admission-webhook logs
Actual results:
verbose prometheus-operator-admission-webhook logs
The origin test suite does not test CPMS, so, it should never have a CPMS rollout occur during a run.
We should add a test that checks that, early in the suite, the control plane machines are all named <cluster-name>master<index>. If for any reason we see a control plane machine matching <cluster-name>master<random>-<index> we know that the CPMS has rolled out and the test should be aborted until we work out why the CPMS rolled out.
The hope here is that it becomes very obvious when there are issues with CPMS, even when these issues are introduced by other repositories.
Description of problem:
After install ACM/MCE, there is dropdown list for switching cluster on the top masthead, the items in dropdown list are not marked for i18n, There is no translations for different languages.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-25-000711
How reproducible:
Always
Steps to Reproduce:
1.From operatorhub, install MCE operator and install required operand by default. 2.After refresh browser, check the translation for clusters dropdown list:"All Clusters/local-cluster". 3.
Actual results:
2. There are not marked for i18n, and don't have translation for different languages.
Expected results:
3. They should have translation for different languages.
Additional info:
Description of problem:
[Multi-NIC]EgressIP was not correctly reassigned when label/unlabel egress node
Version-Release number of selected component (if applicable):
Tested PR openshift/cluster-network-operator#1969,openshift/ovn-kubernetes#1832 together
How reproducible:
Steps to Reproduce:
1. Label worker-0 node as egress node, and create one egressip object # oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-1 172.22.0.100 worker-0 172.22.0.100 2. Create another egressIP object, the egressIP located on worker-0 as well. # oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-1 172.22.0.100 worker-0 172.22.0.100 egressip-2 172.22.0.101 worker-0 172.22.0.101 3. Checked secondary NIC on egress node, the two IPs were correctly added 2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 00:da:86:9b:3e:ac brd ff:ff:ff:ff:ff:ff inet 172.22.0.86/24 brd 172.22.0.255 scope global dynamic noprefixroute enp1s0 valid_lft 96sec preferred_lft 96sec inet 172.22.0.100/32 scope global enp1s0ovn valid_lft forever preferred_lft forever inet 172.22.0.101/32 scope global enp1s0ovn valid_lft forever preferred_lft forever inet6 fe80::2da:86ff:fe9b:3eac/64 scope link noprefixroute valid_lft forever preferred_lft forever 4. Label another node worker-1 as egress node 5. Delete egressip-2 and recreated it, egressip-2 is on worker-1 # oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-1 172.22.0.100 worker-0 172.22.0.100 egressip-2 172.22.0.101 worker-1 172.22.0.101 6. Unlabel egress from worker-1, 172.22.0.101 was reassigned to worker-0 # oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-1 172.22.0.100 worker-0 172.22.0.100 egressip-2 172.22.0.101 worker-0 172.22.0.101 7, Check the worker-0's and worker-1' secondary NIC 3.
Actual results:
EgressIP was not removed from worker-1 # oc debug node/worker-1 Starting pod/worker-1-debug-pw7xk ... To use host binaries, run `chroot /host` Pod IP: 192.168.111.24 If you don't see a command prompt, try pressing enter. sh-4.4# ip a show enp1s0 2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 00:da:86:9b:3e:b0 brd ff:ff:ff:ff:ff:ff inet 172.22.0.90/24 brd 172.22.0.255 scope global dynamic noprefixroute enp1s0 valid_lft 115sec preferred_lft 115sec inet 172.22.0.101/32 scope global enp1s0ovn valid_lft forever preferred_lft forever inet6 fe80::2da:86ff:fe9b:3eb0/64 scope link noprefixroute valid_lft forever preferred_lft forever 172.22.0.100 was missed from worker-0 # oc debug node/worker-0 Starting pod/worker-0-debug-8nz5f ... To use host binaries, run `chroot /host` Pod IP: 192.168.111.23 If you don't see a command prompt, try pressing enter. sh-4.4# ip a show enp1s0 2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 00:da:86:9b:3e:ac brd ff:ff:ff:ff:ff:ff inet 172.22.0.86/24 brd 172.22.0.255 scope global dynamic noprefixroute enp1s0 valid_lft 68sec preferred_lft 68sec inet 172.22.0.101/32 scope global enp1s0ovn valid_lft forever preferred_lft forever inet6 fe80::2da:86ff:fe9b:3eac/64 scope link noprefixroute valid_lft forever preferred_lft forever
Expected results:
The egressIP should be correctly reassigned to correct egress node
Additional info:
This is a clone of issue OCPBUGS-26236. The following is the description of the original issue:
—
Description of problem:
VolumeSnapshots data is not displayed in PVC > VolumeSnapshots tab
Version-Release number of selected component (if applicable):
4.16.0-0.ci-2024-01-05-050911
How reproducible:
Steps to Reproduce:
1. Create a PVC i.e. "my-pvc" 2. Create a Pod and bind it to the "my-pvc" 3. Create a VolumeSnapshots and associate it with the "my-pvc" 4. Goto to PVC detail > VolumeSnapshots tab
Actual results:
VolumeSnapshots data is not displayed in PVC > VolumeSnapshots tab
Expected results:
VolumeSnapshots data should be displayed in PVC > VolumeSnapshots tab
Additional info:
This is a clone of issue OCPBUGS-29932. The following is the description of the original issue:
—
Description of problem:
Sample job: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-azure-4.15-nightly-x86-data-path-9nodes/1760228008968327168
Version-Release number of selected component (if applicable):
How reproducible:
Anytime there is an error from the move-blobs command
Steps to Reproduce:
1. 2. 3.
Actual results:
An error message is shown
Expected results:
A panic is shown followed by the error message
Additional info:
Tracker issue for bootimage bump in 4.16. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-30600.
Please review the following PR: https://github.com/openshift/cloud-provider-alibaba-cloud/pull/39
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Bring the downstream rukpak repo up-to-date with the v0.15.0 upstream release.
Description of problem:
While installing 3618 SNOs via ZTP using ACM 2.9, 15 clusters failed to complete install and have failed on the cluster-autoscaler operator. This represents the bulk of all cluster install failures in this testbed for OCP 4.14.0-rc.0. # cat aci.InstallationFailed.autoscaler | xargs -I % sh -c "echo -n '% '; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get clusterversion --no-headers " vm00527 version False True 20h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm00717 version False True 14h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm00881 version False True 19h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm00998 version False True 18h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm01006 version False True 17h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm01059 version False True 15h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm01155 version False True 14h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm01930 version False True 17h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm02407 version False True 16h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm02651 version False True 18h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm03073 version False True 19h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm03258 version False True 20h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm03295 version False True 14h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm03303 version False True 15h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available vm03517 version False True 18h Unable to apply 4.14.0-rc.0: the cluster operator cluster-autoscaler is not available
Version-Release number of selected component (if applicable):
Hub 4.13.11 Deployed SNOs 4.14.0-rc.0 ACM 2.9 - 2.9.0-DOWNSTREAM-2023-09-07-04-47-52
How reproducible:
15 out of 20 failures (75% of the failures) 15 out of 3618 total attempted SNOs to be installed ~.4% of all installs
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
It appears that some show in the logs of the cluster-autoscaler-operator an error, Example: I0912 19:54:39.962897 1 main.go:15] Go Version: go1.20.5 X:strictfipsruntime I0912 19:54:39.962977 1 main.go:16] Go OS/Arch: linux/amd64 I0912 19:54:39.962982 1 main.go:17] Version: cluster-autoscaler-operator v4.14.0-202308301903.p0.gb57f5a9.assembly.stream-dirty I0912 19:54:39.963137 1 leaderelection.go:122] The leader election gives 4 retries and allows for 30s of clock skew. The kube-apiserver downtime tolerance is 78s. Worst non-graceful lease acquisition is 2m43s. Worst graceful lease acquisition is {26s}. I0912 19:54:39.975478 1 listener.go:44] controller-runtime/metrics "msg"="Metrics server is starting to listen" "addr"="127.0.0.1:9191" I0912 19:54:39.976939 1 server.go:187] controller-runtime/webhook "msg"="Registering webhook" "path"="/validate-clusterautoscalers" I0912 19:54:39.976984 1 server.go:187] controller-runtime/webhook "msg"="Registering webhook" "path"="/validate-machineautoscalers" I0912 19:54:39.977082 1 main.go:41] Starting cluster-autoscaler-operator I0912 19:54:39.977216 1 server.go:216] controller-runtime/webhook/webhooks "msg"="Starting webhook server" I0912 19:54:39.977693 1 certwatcher.go:161] controller-runtime/certwatcher "msg"="Updated current TLS certificate" I0912 19:54:39.977813 1 server.go:273] controller-runtime/webhook "msg"="Serving webhook server" "host"="" "port"=8443 I0912 19:54:39.977938 1 certwatcher.go:115] controller-runtime/certwatcher "msg"="Starting certificate watcher" I0912 19:54:39.978008 1 server.go:50] "msg"="starting server" "addr"={"IP":"127.0.0.1","Port":9191,"Zone":""} "kind"="metrics" "path"="/metrics" I0912 19:54:39.978052 1 leaderelection.go:245] attempting to acquire leader lease openshift-machine-api/cluster-autoscaler-operator-leader... I0912 19:54:39.982052 1 leaderelection.go:255] successfully acquired lease openshift-machine-api/cluster-autoscaler-operator-leader I0912 19:54:39.983412 1 controller.go:177] "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.ClusterAutoscaler" I0912 19:54:39.983462 1 controller.go:177] "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.Deployment" I0912 19:54:39.983483 1 controller.go:177] "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.Service" I0912 19:54:39.983501 1 controller.go:177] "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.ServiceMonitor" I0912 19:54:39.983520 1 controller.go:177] "msg"="Starting EventSource" "controller"="cluster_autoscaler_controller" "source"="kind source: *v1.PrometheusRule" I0912 19:54:39.983532 1 controller.go:185] "msg"="Starting Controller" "controller"="cluster_autoscaler_controller" I0912 19:54:39.986041 1 controller.go:177] "msg"="Starting EventSource" "controller"="machine_autoscaler_controller" "source"="kind source: *v1beta1.MachineAutoscaler" I0912 19:54:39.986065 1 controller.go:177] "msg"="Starting EventSource" "controller"="machine_autoscaler_controller" "source"="kind source: *unstructured.Unstructured" I0912 19:54:39.986072 1 controller.go:185] "msg"="Starting Controller" "controller"="machine_autoscaler_controller" I0912 19:54:40.095808 1 webhookconfig.go:72] Webhook configuration status: created I0912 19:54:40.101613 1 controller.go:219] "msg"="Starting workers" "controller"="cluster_autoscaler_controller" "worker count"=1 I0912 19:54:40.102857 1 controller.go:219] "msg"="Starting workers" "controller"="machine_autoscaler_controller" "worker count"=1 E0912 19:58:48.113290 1 leaderelection.go:327] error retrieving resource lock openshift-machine-api/cluster-autoscaler-operator-leader: Get "https://[fd02::1]:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-autoscaler-operator-leader": net/http: TLS handshake timeout - error from a previous attempt: unexpected EOF E0912 20:02:48.135610 1 leaderelection.go:327] error retrieving resource lock openshift-machine-api/cluster-autoscaler-operator-leader: Get "https://[fd02::1]:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-autoscaler-operator-leader": dial tcp [fd02::1]:443: connect: connection refused E0913 13:49:02.118757 1 leaderelection.go:327] error retrieving resource lock openshift-machine-api/cluster-autoscaler-operator-leader: Get "https://[fd02::1]:443/apis/coordination.k8s.io/v1/namespaces/openshift-machine-api/leases/cluster-autoscaler-operator-leader": dial tcp [fd02::1]:443: connect: connection refused
Description of problem:
When running an Azure install, the installer noticeably hangs for a long time when running create manifests or create cluster. It will sit unresponsive for almost 2 minutes at: DEBUG OpenShift Installer unreleased-master-9741-gbc9836aa9bd3a4f10d229bb6f87981dddf2adc92 DEBUG Built from commit bc9836aa9bd3a4f10d229bb6f87981dddf2adc92 DEBUG Fetching Metadata... DEBUG Loading Metadata... DEBUG Loading Cluster ID... DEBUG Loading Install Config... DEBUG Loading SSH Key... DEBUG Loading Base Domain... DEBUG Loading Platform... DEBUG Loading Cluster Name... DEBUG Loading Base Domain... DEBUG Loading Platform... DEBUG Loading Pull Secret... DEBUG Loading Platform... INFO Credentials loaded from file "/root/.azure/osServicePrincipal.json" This could also be related to failures we see in CI such as this: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_installer/8123/pull-ci-openshift-installer-master-e2e-azure-ovn/1773611162923962368 level=info msg=Consuming Worker Machines from target directory level=info msg=Credentials loaded from file "/var/run/secrets/ci.openshift.io/cluster-profile/osServicePrincipal.json" level=fatal msg=failed to fetch Terraform Variables: failed to generate asset "Terraform Variables": error connecting to Azure client: failed to list SKUs: compute.ResourceSkusClient#List: Failure responding to request: StatusCode=200 -- Original Error: Error occurred reading http.Response#Body - Error = 'read tcp 10.128.117.2:43870->4.150.240.10:443: read: connection reset by peer' If the call takes too long and the context timeout is canceled, we might potentially see this error.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Run azure install 2. 3.
Actual results:
Expected results:
Additional info:
https://github.com/openshift/installer/pull/8134 has a partial fix
Please review the following PR: https://github.com/openshift/vmware-vsphere-csi-driver-operator/pull/170
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-25313. The following is the description of the original issue:
—
Description of problem:
Unable to view the alerts, metrics page, getting a blank page.
Version-Release number of selected component (if applicable):
4.15.0-nightly
How reproducible:
Always
Steps to Reproduce:
Click on any alert under "Notification Panel" to view more, and you will be redirected to the alert page.
Actual results:
User is unable to view any alerts, metrics.
Expected results:
User should be able to view all/individual alerts, metrics.
Additional info:
N.A
Story (Required)
As an ODC helm backend developer I would like to be able to bump version of helm to 3.13 to stay synched up with the version we will ship with OCP 4.15
Background (Required)
Normal activity we do every time a new OCP version is release to stay current
Glossary
NA
Out of scope
NA
Approach(Required)
Bump version of helm to 3.13 run, build and unit test and make sure everything is working as expected. Last time we had a conflict with DevFile backend.
Dependencies
Might had dependencies with DevFile team to move some dependencies forward
OCPBUGS-18596 and OCPBUGS-22382 track issues on metal and vsphere jobs with disruption for image registry. By default image registry is not enabled for these platforms but is enabled, in a non HA manor, for the tests. During discussion around the issue it was decided that unless / until these teams support HA deployments of image registry we should not be monitoring them for disruption.
Devan floated the idea of checking to see if the image registry deployment set has replicas enabled and if not then selectively disable disruption monitoring.
Description of problem:
The RHDP-Developer/DXP team wants to deep-link some catalog pages with a filter on the Developer Sandbox cluster. The target page was shown without any query parameter when the user wasn't logged in.
Version-Release number of selected component (if applicable):
At least 4.13 (Dev Sandbox clusters run 4.13.13 currently.)
How reproducible:
Always when not logged in
Steps to Reproduce:
/catalog/ns/cjerolim-dev?catalogType=BuilderImage&keyword=.NET
Actual results:
The Developer Catalog is opened, but the catalog type "Build Images" and keyword filter ".NET" are not applied.
All Developer Catalog items are shown.
Expected results:
The Developer Catalog should open with the catalog type "Build Images" and the keyword filter ".NET" applied.
Exactly one catalog item should be shown.
Additional info:
Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/85
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-23925. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/multus-cni/pull/202
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-27455. The following is the description of the original issue:
—
Problem Description:
Installed the Red Hat Quay Container Security Operator on the 4.13.25 cluster .
Below are my test results :
```
sasakshi@sasakshi ~]$ oc version
Client Version: 4.12.7
Kustomize Version: v4.5.7
Server Version: 4.13.25
Kubernetes Version: v1.26.9+aa37255
[sasakshi@sasakshi ~]$ oc get csv -A | grep -i "quay" | tail -1
openshift container-security-operator.v3.10.2 Red Hat Quay Container Security Operator 3.10.2 container-security-operator.v3.10.1 Succeeded
[sasakshi@sasakshi ~]$ oc get subs -A
NAMESPACE NAME PACKAGE SOURCE CHANNEL
openshift-operators container-security-operator container-security-operator redhat-operators stable-3.10
[sasakshi@sasakshi ~]$ oc get imagemanifestvuln -A | wc -l
82
[sasakshi@sasakshi ~]$ oc get vuln --all-namespaces | wc -l
82
Console -> Administration -> Image Vulnerabitlites : 82
Home -> Overiview -> Status -> Image Vulnerabitlites : 66
```
Observations from My testing :
Kindly refer to the attached screenshots for reference .
Documentation link referred:
Description of problem:
'404: Not Found' will show on Knative-serving Details page
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-06-13-223353
How reproducible:
Always
Steps to Reproduce:
1. Installed 'Serveless' Operator, make sure the operator has been installed successfully, and the Knative Serving instance is created without any error 2. Navigate to Administration -> Cluster Settings -> Global Configuration 3. Go to Knative-serving Details page, check if 404 not found message is there 3.
Actual results:
Page will show 404 not found
Expected results:
the 404 not found page should not show
Additional info:
the dependency ticket is OCPBUGs-15008, more information could be checked in the comment
This is just a placeholder bug in 4.15.
the original bug ( https://issues.redhat.com/browse/OCPBUGS-20472 ) does not exist in 4.15 release.
===
Description of problem:
prow CI job: periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-nightly-4.14-upgrade-from-stable-4.13-aws-ipi-ovn-hypershift-replace-f7 failed in the step of upgrading the HCP image of the hosted cluster.
one failed job link: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/opens[…]-hypershift-replace-f7/1712338041915314176
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
* retrigger/rehearsal the job or * create a 4.13 stable hosted cluster and upgrade it to 4.14 nightly manually
Actual results:
the upgrade failed using 4.14 nightly image for `hostedcluster`
Expected results:
upgrade for hostedcluster/nodepool successfully
Additional info:
we could get dump file from the job artifacts
Description of problem:
OCPv4.14.1 installation its failing because VIP is not being allocated to the bootstrap node
Version-Release number of selected component (if applicable):
OCPv4.14.1
How reproducible:
100% --> https://access.redhat.com/support/cases/#/case/03668010
Steps to Reproduce:
1. 2. 3.
Actual results:
https://access.redhat.com/support/cases/#/case/03668010/discussion?commentId=a0a6R00000Vmdf3QAB
Expected results:
OCP installation to end sucessfully
Additional info:
In the comment https://access.redhat.com/support/cases/#/case/03668010/discussion?commentId=a0a6R00000Vmdf3QAB are described the current state and issue. If additional logs are required I can arrange for this.
Seen in 4.15 update CI:
: [bz-Monitoring] clusteroperator/monitoring should not change condition/Available expand_less Run #0: Failed expand_less 1h16m1s { 1 unexpected clusteroperator state transitions during e2e test run Nov 21 04:20:56.837 - 19s E clusteroperator/monitoring condition/Available reason/UpdatingPrometheusK8SFailed status/False reconciling Prometheus Federate Route failed: retrieving Route object failed: etcdserver: leader changed}
While the Kube API server is supposed to buffer clients from etcd leader transitions, an issue that only persists for 19s is not long enough to warrant immediate admin intervention. Teaching the monitoring operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.
A bunch of 4.15 jobs are impacted, almost all update jobs:
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/monitoring+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort periodic-ci-openshift-cluster-etcd-operator-release-4.15-periodics-e2e-aws-etcd-recovery (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-hypershift-release-4.15-periodics-e2e-kubevirt-conformance (all) - 2 runs, 50% failed, 200% of failures match = 100% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-s390x (all) - 6 runs, 100% failed, 33% of failures match = 33% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 5 runs, 40% failed, 50% of failures match = 20% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 5 runs, 20% failed, 100% of failures match = 20% impact periodic-ci-openshift-release-master-ci-4.15-e2e-aws-upgrade-ovn-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 50 runs, 56% failed, 4% of failures match = 2% impact periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-upgrade (all) - 5 runs, 20% failed, 100% of failures match = 20% impact periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 80 runs, 44% failed, 9% of failures match = 4% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 30% failed, 17% of failures match = 5% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 43% failed, 38% of failures match = 16% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 52 runs, 15% failed, 175% of failures match = 27% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-upgrade (all) - 5 runs, 20% failed, 100% of failures match = 20% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-sdn-upgrade (all) - 5 runs, 60% failed, 33% of failures match = 20% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial (all) - 5 runs, 100% failed, 40% of failures match = 40% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-ibmcloud-csi (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 5 runs, 20% failed, 100% of failures match = 20% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-upgrade-ovn-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-sdn-bm-upgrade (all) - 5 runs, 100% failed, 20% of failures match = 20% impact
Hit rates are low enough there that I haven't checked older 4.y. I'm not sure if all of those hits are UpdatingPrometheusK8SFailed or not, it seems likely that Kube API hiccups could impact a number of control loops. And there may be other triggers going on besides Kube API hiccups.
16% impact in periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade looks like the current largest impact percentage among the jobs with double-digit run counts.
Run periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade or another job with a combination of high-ish impact percentage and high run counts, watching the monitoring ClusterOperator's Available condition.
Blips of Available=False that resolve more quickly than a responding admin could be expected to show up.
Only going Available=False when it seems reasonable to summon an emergency admin response.
I have no problem if folks decide to push for Kube API server / etcd perfection, but that seems like a hard goal to reach reliably in the mess of the real world, so even if you do push those folks for improvements, I think it makes sense to relax your response to those kinds of issues to only complain when things like Route object retrieval failures go on for long enough for the operator to be seriously
After https://github.com/openshift/console/pull/13102 got merged, it isn't possible to start the local console bridge anymore.
The UI crashes with this error:
Uncaught TypeError: Failed to construct 'URL': Invalid URL at ./public/module/auth.js (main-c115e44b78283c32bc69.js:81514:7) at __webpack_require__ (runtime~main-bundle.js:90:30)
The loginErrorURL is a string that couldn't get parsed with new URL:
window.SERVER_FLAGS.loginErrorURL '/auth/error' new URL(window.SERVER_FLAGS.loginErrorURL) VM55:1 Uncaught TypeError: Failed to construct 'URL': Invalid URL
Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver/pull/51
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Some IBM jobs using openshift-test run failed due to recent monitor refactor. They request command options to disable monitor tests in openshift-test run. This is already implemented in openshift-test run-monitor.
IBM-Roks needs this, will link to slack thread
spec: configuration: featureGate: featureSet: TechPreviewNoUpgrade
$ oc get pod NAME READY STATUS RESTARTS AGE capi-provider-bd4858c47-sf5d5 0/2 Init:0/1 0 9m33s cluster-api-85f69c8484-5n9ql 1/1 Running 0 9m33s control-plane-operator-78c9478584-xnjmd 2/2 Running 0 9m33s etcd-0 3/3 Running 0 9m10s kube-apiserver-55bb575754-g4694 4/5 CrashLoopBackOff 6 (81s ago) 8m30s $ oc logs kube-apiserver-55bb575754-g4694 -c kube-apiserver --tail=5 E0105 16:49:54.411837 1 controller.go:145] while syncing ConfigMap "kube-system/kube-apiserver-legacy-service-account-token-tracking", err: namespaces "kube-system" not found I0105 16:49:54.415074 1 trace.go:236] Trace[236726897]: "Create" accept:application/vnd.kubernetes.protobuf, */*,audit-id:71496035-d1fe-4ee1-bc12-3b24022ea39c,client:::1,api-group:scheduling.k8s.io,api-version:v1,name:,subresource:,namespace:,protocol:HTTP/2.0,resource:priorityclasses,scope:resource,url:/apis/scheduling.k8s.io/v1/priorityclasses,user-agent:kube-apiserver/v1.29.0 (linux/amd64) kubernetes/9368fcd,verb:POST (05-Jan-2024 16:49:44.413) (total time: 10001ms): Trace[236726897]: ---"Write to database call failed" len:174,err:priorityclasses.scheduling.k8s.io "system-node-critical" is forbidden: not yet ready to handle request 10001ms (16:49:54.415) Trace[236726897]: [10.001615835s] [10.001615835s] END F0105 16:49:54.415382 1 hooks.go:203] PostStartHook "scheduling/bootstrap-system-priority-classes" failed: unable to add default system priority classes: priorityclasses.scheduling.k8s.io "system-node-critical" is forbidden: not yet ready to handle request
Please review the following PR: https://github.com/openshift/csi-external-attacher/pull/57
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/aws-pod-identity-webhook/pull/167
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
If ccm disabled in cloud such as aws, installation will continue until failed in ingress LoadBalancerPending
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Build image with pr openshift/cluster-cloud-controller-manager-operator#284,openshift/installer#7546,openshift/cluster-version-operator#979,openshift/machine-config-operator#3999 2. Install cluster on aws with "baselineCapabilitySet: v4.14" 3.
Actual results:
Installation failed, ingress LoadBalancerPending. $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-25-230.us-east-2.compute.internal Ready control-plane,master 86m v1.28.3+20a5764 ip-10-0-3-101.us-east-2.compute.internal Ready worker 78m v1.28.3+20a5764 ip-10-0-46-198.us-east-2.compute.internal Ready control-plane,master 87m v1.28.3+20a5764 ip-10-0-48-220.us-east-2.compute.internal Ready worker 80m v1.28.3+20a5764 ip-10-0-79-203.us-east-2.compute.internal Ready control-plane,master 86m v1.28.3+20a5764 ip-10-0-95-83.us-east-2.compute.internal Ready worker 78m v1.28.3+20a5764 $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest False False True 85m OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.zhsun-aws1.qe.devcluster.openshift.com/healthz": dial tcp: lookup oauth-openshift.apps.zhsun-aws1.qe.devcluster.openshift.com on 172.30.0.10:53: no such host (this is likely result of malfunctioning DNS server) baremetal 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 84m cloud-credential 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 86m cluster-autoscaler 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 84m config-operator 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 85m console 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest False True False 79m DeploymentAvailable: 0 replicas available for console deployment... control-plane-machine-set 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 81m csi-snapshot-controller 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 85m dns 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 84m etcd 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 83m image-registry 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 78m ingress False True True 78m The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (LoadBalancerPending: The LoadBalancer service is pending) insights 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 78m kube-apiserver 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 71m kube-controller-manager 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 82m kube-scheduler 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 82m kube-storage-version-migrator 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 85m machine-api 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 77m machine-approver 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 84m machine-config 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 84m marketplace 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 84m monitoring 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 73m network 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 86m node-tuning 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 78m openshift-apiserver 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 71m openshift-controller-manager 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 75m openshift-samples 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 78m service-ca 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False False 85m storage 4.15.0-0.ci.test-2023-11-22-202439-ci-ln-d46q8qt-latest True False
Expected results:
Tell users not to turn CCM off for cloud.
Additional info:
This is a clone of issue OCPBUGS-29355. The following is the description of the original issue:
—
When clicking on the output image link on a Shipwright BuildRun details page, the link leads to the imagestream details page but shows 404 error.
The image link is:
https://console-openshift-console.apps...openshiftapps.com/k8s/ns/buildah-example/imagestreams/sample-kotlin-spring%3A1.0-shipwright
The BuildRun spec
apiVersion: shipwright.io/v1beta1 kind: BuildRun metadata: generateName: sample-spring-kotlin-build- name: sample-spring-kotlin-build-xh2dq namespace: buildah-example labels: build.shipwright.io/generation: '2' build.shipwright.io/name: sample-spring-kotlin-build spec: build: name: sample-spring-kotlin-build status: buildSpec: output: image: 'image-registry.openshift-image-registry.svc:5000/buildah-example/sample-kotlin-spring:1.0-shipwright' paramValues: - name: run-image value: 'paketocommunity/run-ubi-base:latest' - name: cnb-builder-image value: 'paketobuildpacks/builder-jammy-tiny:0.0.176' - name: app-image value: 'image-registry.openshift-image-registry.svc:5000/buildah-example/sample-kotlin-spring:1.0-shipwright' source: git: url: 'https://github.com/piomin/sample-spring-kotlin-microservice.git' type: Git strategy: kind: ClusterBuildStrategy name: buildpacks completionTime: '2024-02-12T12:15:03Z' conditions: - lastTransitionTime: '2024-02-12T12:15:03Z' message: All Steps have completed executing reason: Succeeded status: 'True' type: Succeeded output: digest: 'sha256:dc3d44bd4d43445099ab92bbfafc43d37e19cfaf1cac48ae91dca2f4ec37534e' source: git: branchName: master commitAuthor: Piotr Mińkowski commitSha: aeb03d60a104161d6fd080267bf25c89c7067f61 startTime: '2024-02-12T12:13:21Z' taskRunName: sample-spring-kotlin-build-xh2dq-j47ql
Causing payload rejection now.
https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/304/files caused it
From Trevor:
https://redhat-internal.slack.com/archives/CBZHF4DHC/p1701485079971669
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27231/pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6/1730718245385670656
: [sig-arch] events should not repeat pathologically for ns/openshift-cloud-controller-manager-operator expand_less 0s
{ 1 events happened too frequently
event happened 374 times, something is wrong: namespace/openshift-cloud-controller-manager-operator node/master-1.ostest.test.metalkube.org pod/cluster-cloud-controller-manager-operator-5b6b87b648-rzdbc hmsg/873af7a9ec - reason/BackOff Back-off pulling image "quay.io/openshift/origin-kube-rbac-proxy:4.2.0" From: 00:53:59Z To: 00:54:00Z result=reject }
4.2 is an old-sounding tag? Seems like not-a-flake, but still gathering data
This is a clone of issue OCPBUGS-24226. The following is the description of the original issue:
—
Maxim Patlasov pointed this out in STOR-1453 but still somehow we missed it. I tested this on 4.15.0-0.ci-2023-11-29-021749.
It is possible to set a custom TLSSecurityProfile without minTLSversion:
$ oc edit apiserver cluster
...
spec:
tlsSecurityProfile:
type: Custom
custom:
ciphers:
- ECDHE-ECDSA-CHACHA20-POLY1305
- ECDHE-ECDSA-AES128-GCM-SHA256
This causes the controller to crash loop:
$ oc get pods -n openshift-cluster-csi-drivers
NAME READY STATUS RESTARTS AGE
aws-ebs-csi-driver-controller-589c44468b-gjrs2 6/11 CrashLoopBackOff 10 (18s ago) 37s
...
because the `${TLS_MIN_VERSION}` placeholder is never replaced:
- --tls-min-version=${TLS_MIN_VERSION}
- --tls-min-version=${TLS_MIN_VERSION}
- --tls-min-version=${TLS_MIN_VERSION}
- --tls-min-version=${TLS_MIN_VERSION}
- --tls-min-version=${TLS_MIN_VERSION}
The observed config in the ClusterCSIDriver shows an empty string:
$ oc get clustercsidriver ebs.csi.aws.com -o json | jq .spec.observedConfig
{
"targetcsiconfig": {
"servingInfo":
}
}
which means minTLSVersion is empty when we get to this line, and the string replacement is not done:
So it seems we have a couple of options:
1) completely omit the --tls-min-version arg if minTLSVersion is empty, or
2) set --tls-min-version to the same default value we would use if TLSSecurityProfile is not present in the apiserver object
This is a clone of issue OCPBUGS-25372. The following is the description of the original issue:
—
Description of problem:
Find in QE's CI (with vsphere-agent profile), storage CO is not avaliable and vsphere-problem-detector-operator pod is CrashLoopBackOff with panic. (Find must-garther here: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-vsphere-agent-disconnected-ha-f14/1734850632575094784/artifacts/vsphere-agent-disconnected-ha-f14/gather-must-gather/) The storage CO reports "unable to find VM by UUID": - lastTransitionTime: "2023-12-13T09:15:27Z" message: "VSphereCSIDriverOperatorCRAvailable: VMwareVSphereControllerAvailable: unable to find VM ci-op-782gwsbd-b3d4e-master-2 by UUID \nVSphereProblemDetectorDeploymentControllerAvailable: Waiting for Deployment" reason: VSphereCSIDriverOperatorCR_VMwareVSphereController_vcenter_api_error::VSphereProblemDetectorDeploymentController_Deploying status: "False" type: Available (But I did not see the "unable to find VM by UUID" from vsphere-problem-detector-operator log in must-gather) The vsphere-problem-detector-operator log: 2023-12-13T10:10:56.620216117Z I1213 10:10:56.620159 1 vsphere_check.go:149] Connected to vcenter.devqe.ibmc.devcluster.openshift.com as ci_user_01@devqe.ibmc.devcluster.openshift.com 2023-12-13T10:10:56.625161719Z I1213 10:10:56.625108 1 vsphere_check.go:271] CountVolumeTypes passed 2023-12-13T10:10:56.625291631Z I1213 10:10:56.625258 1 zones.go:124] Checking tags for multi-zone support. 2023-12-13T10:10:56.625449771Z I1213 10:10:56.625433 1 zones.go:202] No FailureDomains configured. Skipping check. 2023-12-13T10:10:56.625497726Z I1213 10:10:56.625487 1 vsphere_check.go:271] CheckZoneTags passed 2023-12-13T10:10:56.625531795Z I1213 10:10:56.625522 1 info.go:44] vCenter version is 8.0.2, apiVersion is 8.0.2.0 and build is 22617221 2023-12-13T10:10:56.625562833Z I1213 10:10:56.625555 1 vsphere_check.go:271] ClusterInfo passed 2023-12-13T10:10:56.625603236Z I1213 10:10:56.625594 1 datastore.go:312] checking datastore /DEVQEdatacenter/datastore/vsanDatastore for permissions 2023-12-13T10:10:56.669205822Z panic: runtime error: invalid memory address or nil pointer dereference 2023-12-13T10:10:56.669338411Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x23096cb] 2023-12-13T10:10:56.669565413Z 2023-12-13T10:10:56.669591144Z goroutine 550 [running]: 2023-12-13T10:10:56.669838383Z github.com/openshift/vsphere-problem-detector/pkg/operator.getVM(0xc0005da6c0, 0xc0002d3b80) 2023-12-13T10:10:56.669991749Z github.com/openshift/vsphere-problem-detector/pkg/operator/vsphere_check.go:319 +0x3eb 2023-12-13T10:10:56.670212441Z github.com/openshift/vsphere-problem-detector/pkg/operator.(*vSphereChecker).enqueueSingleNodeChecks.func1() 2023-12-13T10:10:56.670289644Z github.com/openshift/vsphere-problem-detector/pkg/operator/vsphere_check.go:238 +0x55 2023-12-13T10:10:56.670490453Z github.com/openshift/vsphere-problem-detector/pkg/operator.(*CheckThreadPool).worker.func1(0xc000c88760?, 0x0?) 2023-12-13T10:10:56.670702592Z github.com/openshift/vsphere-problem-detector/pkg/operator/pool.go:40 +0x55 2023-12-13T10:10:56.671142070Z github.com/openshift/vsphere-problem-detector/pkg/operator.(*CheckThreadPool).worker(0xc000c78660, 0xc000c887a0?) 2023-12-13T10:10:56.671331852Z github.com/openshift/vsphere-problem-detector/pkg/operator/pool.go:41 +0xe7 2023-12-13T10:10:56.671529761Z github.com/openshift/vsphere-problem-detector/pkg/operator.NewCheckThreadPool.func1() 2023-12-13T10:10:56.671589925Z github.com/openshift/vsphere-problem-detector/pkg/operator/pool.go:28 +0x25 2023-12-13T10:10:56.671776328Z created by github.com/openshift/vsphere-problem-detector/pkg/operator.NewCheckThreadPool 2023-12-13T10:10:56.671847478Z github.com/openshift/vsphere-problem-detector/pkg/operator/pool.go:27 +0x73
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-11-033133
How reproducible:
Steps to Reproduce:
1. See description 2. 3.
Actual results:
vpd is panic
Expected results:
vpd should not panic
Additional info:
I guess it is privileges issue, but our pod should not be panic.
Description of problem:
Cluster install failed on ibm cloud and machine-api-controllers stucks in CrashLoopBackOff
Version-Release number of selected component (if applicable):
from 4.16.0-0.nightly-2024-02-02-224339
How reproducible:
Always
Steps to Reproduce:
1. Install cluster on IBMCloud 2. 3.
Actual results:
Cluster install failed $ oc get node NAME STATUS ROLES AGE VERSION maxu-16-gp2vp-master-0 Ready control-plane,master 7h11m v1.29.1+2f773e8 maxu-16-gp2vp-master-1 Ready control-plane,master 7h11m v1.29.1+2f773e8 maxu-16-gp2vp-master-2 Ready control-plane,master 7h11m v1.29.1+2f773e8 $ oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE maxu-16-gp2vp-master-0 7h15m maxu-16-gp2vp-master-1 7h15m maxu-16-gp2vp-master-2 7h15m maxu-16-gp2vp-worker-1-xfvqq 7h5m maxu-16-gp2vp-worker-2-5hn7c 7h5m maxu-16-gp2vp-worker-3-z74z2 7h5m openshift-machine-api machine-api-controllers-6cb7fcdcdb-k6sv2 6/7 CrashLoopBackOff 92 (31s ago) 7h1m $ oc logs -n openshift-machine-api -c machine-controller machine-api-controllers-6cb7fcdcdb-k6sv2 I0204 10:53:34.336338 1 main.go:120] Watching machine-api objects only in namespace "openshift-machine-api" for reconciliation.panic: runtime error: invalid memory address or nil pointer dereference[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x285fe72] goroutine 25 [running]:k8s.io/klog/v2/textlogger.(*tlogger).Enabled(0x0?, 0x0?) /go/src/github.com/openshift/machine-api-provider-ibmcloud/vendor/k8s.io/klog/v2/textlogger/textlogger.go:81 +0x12sigs.k8s.io/controller-runtime/pkg/log.(*delegatingLogSink).Enabled(0xc000438100, 0x0?) /go/src/github.com/openshift/machine-api-provider-ibmcloud/vendor/sigs.k8s.io/controller-runtime/pkg/log/deleg.go:114 +0x92github.com/go-logr/logr.Logger.Info({{0x3232210?, 0xc000438100?}, 0x0?}, {0x2ec78f3, 0x17}, {0x0, 0x0, 0x0}) /go/src/github.com/openshift/machine-api-provider-ibmcloud/vendor/github.com/go-logr/logr/logr.go:276 +0x72sigs.k8s.io/controller-runtime/pkg/metrics/server.(*defaultServer).Start(0xc0003bd2c0, {0x322e350?, 0xc00058a140}) /go/src/github.com/openshift/machine-api-provider-ibmcloud/vendor/sigs.k8s.io/controller-runtime/pkg/metrics/server/server.go:185 +0x75sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile.func1(0xc0002c4540) /go/src/github.com/openshift/machine-api-provider-ibmcloud/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:223 +0xc8created by sigs.k8s.io/controller-runtime/pkg/manager.(*runnableGroup).reconcile in goroutine 24 /go/src/github.com/openshift/machine-api-provider-ibmcloud/vendor/sigs.k8s.io/controller-runtime/pkg/manager/runnable_group.go:207 +0x19d
Expected results:
Cluster install succeed
Additional info:
may relate to this pr https://github.com/openshift/machine-api-provider-ibmcloud/pull/34
Description of problem:
When creating deployments/deployment-config and associated shipwright builds, different decorators associated with node in topology is not visible
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Install pipeline and shipwright operator 2. Create deployment with build runs 3. Run the cluster in the local setup 4. Go to topology where deployments are created
Actual results:
No decorator visible
Expected results:
Decorators should be visible
Additional info:
Description of problem:
Attempting to perform a GCP XPN internal cluster installation, the install fails when the master nodes are added to a second [internal] instance group (k8s-ig-xxxx).
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. The following install config was used: additionalTrustBundlePolicy: Proxyonly apiVersion: v1 baseDomain: installer.gcp.devcluster.openshift.com credentialsMode: Passthrough featureSet: TechPreviewNoUpgrade compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: {} replicas: 3 controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: {} replicas: 3 metadata: creationTimestamp: null name: bbarbach-xpn networking: clusterNetwork: - cidr: 10.124.0.0/14 hostPrefix: 23 machineNetwork: - cidr: 10.128.0.0/16 networkType: OVNKubernetes serviceNetwork: - 172.30.0.0/16 platform: gcp: projectID: openshift-installer-shared-vpc region: us-central1 network: bbarbach-internal-vpc computeSubnet: bbarbach-internal-vpc controlPlaneSubnet: bbarbach-internal-vpc networkProjectID: openshift-dev-installer publish: Internal 2. This is a shared VPC install so the service and host projects need to be used in the install-config above. 3. Set the release image to 4.13-nightly 4. openshift-install create cluster --log-level=DEBUG
Actual results:
ERROR ERROR Error: Error waiting for Updating RegionBackendService: Validation failed for instance 'projects/openshift-installer-shared-vpc/zones/us-central1-a/instances/bbarbach-xpn-4t8zl-master-0': instance may belong to at most one load-balanced instance group. ERROR ERROR ERROR with google_compute_region_backend_service.api_internal, ERROR on main.tf line 13, in resource "google_compute_region_backend_service" "api_internal": ERROR 13: resource "google_compute_region_backend_service" "api_internal" { ERROR FATAL failed disabling bootstrap load balancing: failed to apply Terraform: exit status 1 FATAL FATAL Error: Error waiting for Updating RegionBackendService: Validation failed for instance 'projects/openshift-installer-shared-vpc/zones/us-central1-a/instances/bbarbach-xpn-4t8zl-master-0': instance may belong to at most one load-balanced instance group. FATAL FATAL FATAL with google_compute_region_backend_service.api_internal, FATAL on main.tf line 13, in resource "google_compute_region_backend_service" "api_internal": FATAL 13: resource "google_compute_region_backend_service" "api_internal" { FATAL FATAL
Expected results:
Successful install
Additional info:
The normal GCP internal cluster installation succeeds. Checking the instance groups, the internal cluster creates the k8s-ig-xxxx instance groups where the workers are added to each respective group. The masters are NOT added to the instance groups. The failure during the xpn install occurs because these masters are added to the instance groups.
Description of problem:
When I execute the following two tag commands in a row on OCP 4.14.0-ec.3, Multi-Arch: oc tag $IMAGE@$DIGEST_MANIFEST test-1:tag-manifest sleep 0 oc tag $IMAGE@$DIGEST_MANIFEST test-1:tag-manifest-preserve-original --import-mode=PreserveOriginal Then wrong data is written to the .image.dockerImageMetadata record. If there is a delay between these two commands, e.g. sleep 5, then the image.dockerImageMetadata contains correct data.
Version-Release number of selected component (if applicable):
How reproducible:
Run the below script and you see the error. If you change the SLEEP_TIME=5, then the script passes. No problem.
Steps to Reproduce:
#!/usr/bin/env bash set -e SLEEP_TIME=0 # Test will fail, when sleep time is 0, use delay of 3 sec or more to pass this test IMAGE="quay.io/podman/hello" podman pull $IMAGE:latest DIGEST_MANIFEST=$(podman inspect quay.io/podman/hello:latest | jq -r '.[0].Digest') oc new-project "ir-test-001" oc create imagestream test-1 oc import-image test-1 --from="${IMAGE}@${DIGEST_MANIFEST}" --import-mode='PreserveOriginal' oc tag $IMAGE@$DIGEST_MANIFEST test-1:tag-manifest sleep "${SLEEP_TIME}" oc tag $IMAGE@$DIGEST_MANIFEST test-1:tag-manifest-preserve-original --import-mode=PreserveOriginal sleep 5 [[ $(oc get istag test-1:tag-manifest-preserve-original -o json | jq -r '.image.dockerImageMetadata.Architecture') == "null" ]] && echo "pass: tag-manifest-preserve-original has no architecture" || echo "fail: tag-preserve-original has architecture and should not"
Actual results:
fail: tag-preserve-original has architecture and should not oc get istag test-1:tag-manifest-preserve-original -o json | jq -r '.image.dockerImageMetadata.Architecture' amd64
Expected results:
pass: tag-manifest-preserve-original has no architecture oc get istag test-1:tag-manifest-preserve-original -o json | jq -r '.image.dockerImageMetadata.Architecture' null
Additional info:
This was tested with OC command on x86_64
Please review the following PR: https://github.com/openshift/cluster-api-provider-azure/pull/282
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description:
High volume Pipelinerun/Taskrun logs are not auto scrolling to the bottom of the page.
Steps to reproduce:
1. Create pipelinerun that produces high volume log output
2. navigate to logs page
Video - https://drive.google.com/file/d/17Dc0ME6KYtkyQmW96lT8J_tMfT-dBRbb/view?usp=drive_link
Description of problem:
Due to the way that the termination handlers unit tests are configured, it is possible in some cases for the counter of http requests to the mock handler can cause the test to deadlock and time out. This happens randomly as the ordering of the tests has an effect on when the bug occurs.
Version-Release number of selected component (if applicable):
4.13+
How reproducible:
It happens randomly when run in CI, or when the full suite is run. But if the tests are focused it will happen every time. Focusing on "poll URL cannot be reached" will exploit the unit test.
Steps to Reproduce:
1. add `-focus "poll URL cannot be reached"` to unit test ginkgo arguments 2. run `make unit`
Actual results:
test suite hangs after this output: "Handler Suite when running the handler when polling the termination endpoint and the poll URL cannot be reached should return an error /home/mike/dev/machine-api-provider-aws/pkg/termination/handler_test.go:197"
Expected results:
Tests pass
Additional info:
to fix this we need to isolate the test in its own context block, this patch should do the trick: diff --git a/pkg/termination/handler_test.go b/pkg/termination/handler_test.go index 2b98b08b..0f85feae 100644 --- a/pkg/termination/handler_test.go +++ b/pkg/termination/handler_test.go @@ -187,7 +187,9 @@ var _ = Describe("Handler Suite", func() { Consistently(nodeMarkedForDeletion(testNode.Name)).Should(BeFalse()) }) }) + }) + Context("when the termination endpoint is not valid", func() { Context("and the poll URL cannot be reached", func() { BeforeEach(func() { nonReachable := "abc#1://localhost"
OCP 4.14
Logging 5.8
Always
The user is redirected to Observe -> metrics, and the chart does not display any metrics as they are not stored in prometheus
The user should be redirected to Observe -> Logs, and the metric should be displayed instead of the log list: see OU-267
This is a clone of issue OCPBUGS-27366. The following is the description of the original issue:
—
To support external OIDC on hypershift, but not on self-managed, we need different schemas for the authentication CRD on a default-hypershift versus a default-self-managed. This requires us to change rendering so that it honors the clusterprofile.
Then we have to update the installer to match, then update hypershift, then update the manifests.
This is a clone of issue OCPBUGS-29363. The following is the description of the original issue:
—
Description of problem:
1. TaskRuns list page is loading constantly for all projects 2. Archive icon is not displayed for some tasks in TaskRun list page 3. On change of ns to All Projects, PipelineRuns and TaskRuns are not loading properly
Version-Release number of selected component (if applicable):
4.15.z
How reproducible:
Always
Steps to Reproduce:
1.Create some TaskRun 2.Go to TaskRun list page 3.Select all project in project dropdown
Actual results:
Screen is keep on loading
Expected results:
Should load TaskRuns from all projects
Additional info:
Description of problem:
When configured with a single identity provider that's not capable of login authentication flows, the oauth-server returns error when accessed from the browser. When the oauth-server is accessed from the web console, this error causes redirect loop between the oauth-server and the console.
Version-Release number of selected component (if applicable):
4.5
How reproducible:
100%
Steps to Reproduce:
1. configure request header IdP with some bogus ChallengeURL and no LoginURL
2. disable the kubeadmin user by deleting the kube-system/kubeadmin secret
3. wait for the changes to be applied to the oauth-server's deployment
4. go to the console's URL
Actual results:
The console tries to access a resource, gets "unauthorized" error, redirects user to the oauth-server, the oauth-server errors out because it does not allow browser login, redirects user to console, and the loop repeats infinitely.
Expected results:
The oauth-server presents the user with a login page that won't allow them to log in OR the server errors out with a clear error that tells the console not to try to loop back to it again.
Because the installer generates some of the keys that will remain present in the cluster (e.g. the signing key for the admin kubeconfig), it should also run in an environment where FIPS is enabled.
Because it is very easy to fail to notice that the keys were generated in a non-FIPS-certified environment, we should enforce this by checking that fips_enabled is true if the target cluster is to have FIPS enabled.
Colin Walters has a patch for this.
Please review the following PR: https://github.com/openshift/telemeter/pull/480
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
We need to fix and bump library-go for http2 vulnerability CVE-2023-44487. This effectively turns off HTTP/2 in library-go http endpoints, i.e. metrics and health.
Please review the following PR: https://github.com/openshift/console-operator/pull/818
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of the problem:
When gathering manifests for a cluster from assisted-installer using assisted-test-infra any 'system generated' manifests are not listed.
How reproducible:
Look at any triage ticket that has recently been created, you will notice that the `system-generated` manifests are missing.
Actual results:
Only user-generated manifests are shown by assisted-test-infra
Expected results:
System generated manifests as well as user generated manifests should be listed by assisted-test-infra
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Open in order to backport this work: https://github.com/openshift/cluster-node-tuning-operator/pull/936 to 4.15
This is needed since the MixedCPUs feature is part of 4.15 payload and we need to have the e2e test there as well in order to make sure the feature is in a good shape and none regression is happening.
The tests themselves would not affect the payload though.
Description of problem:
After adding additional CPU and Memory to the OpenShift Container Platform 4 - Control-Plane Node(s) it was noticed that a new MachineConfig was rolled out, causing all OpenShift Container Platform 4 - Node(s) to reboot unexpected. Interesting enough, no new MachineConfig was rendered but actually a slightly older MachineConfig was picked and applied to all OpenShift Container Platform 4 - Node after the change on the OpenShift Container Platform 4 - Control-Plane Node(s) was performed. The only visible change found in the MachineConfig was that nodeStatusUpdateFrequency was updated from 10s to 0s even though nodeStatusUpdateFrequency is not specified or configured in any MachineConfig or KubeletConfig. https://issues.redhat.com/browse/OCPBUGS-6723 was found but given that the affected OpenShift Container Platform 4 - Cluster is running 4.11.35 it's difficult to understand what happen as generally this problem was/is suspected to be solved.
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.11.35
How reproducible:
Unknown
Steps to Reproduce:
1. OpenShift Container Platform 4 on AWS 2. Updating OpenShift Container Platform 4 - Control-Plane Node(s) to add more CPU and Memory 3. Check whether a potential MachineConfig update is being applied
Actual results:
MachineConfig update is being rolled out to all OpenShift Container Platform 4 - Node(s) after adding CPU and Memoy to OpenShift Container Platform 4 - Control-Plane Node(s) as nodeStatusUpdateFrequency is being updated, which is rather unexpected or not clear why it's happening.
Expected results:
Either no new MachineConfig to rollout after such a change or else to have a newly rendered MachineConfig that is being rolled out with information of what changed and why this change was applied
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
WebhookConfiguration caBundle injection is incorrect when some webhooks already configured with caBundle. Behavior seems to be that the first n number of webhooks in `.webhooks` array have caBundle injected, where n is the number of webhooks that do not have caBundle set.
Version-Release number of selected component (if applicable):
How reproducible
Steps to Reproduce:
1. Create a validatingwebhookconfigurations or mutatingwebhookconfigurations with `service.beta.openshift.io/inject-cabundle: "true"` annotation. 2. oc edit validatingwebhookconfigurations (or oc edit mutatingwebhookconfigurations) 3. Add a new webhook to the end of the list `.webhooks`. It will not have caBundle set manually as service-ca should inject it. 4. Observe new webhook does not get caBundle injected. Note: it is important in step. 3 that the new webhook is added to the end of the list.
Actual results:
Only the first n webhooks have caBundle injected where n is the number of webhooks without caBundle set.
Expected results:
All webhooks have caBundle injected when they do not have it set.
Additional info:
Open PR here: https://github.com/openshift/service-ca-operator/pull/207 The issue seems to be a mistake with go-lang for range syntax where "i" is the index of desired "i" to update. tl dr; code should update the value of the int in the array, not the index of the int in the array.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-27465. The following is the description of the original issue:
—
Description of problem:
The test implementation in https://github.com/openshift/origin/commit/5487414d8f5652c301a00617ee18e5ca8f339cb4#L56 assumes there is just one kubelet service or at least that it is always the first one in the MCP. Which just changed in https://github.com/openshift/machine-config-operator/pull/4124 and the test is failing.
Version-Release number of selected component (if applicable):
master branch of 4.16
How reproducible:
always during test
Steps to Reproduce:
1. Test with https://github.com/openshift/machine-config-operator/pull/4124 applied
Actual results:
Test detects a wrong service and fails
Expected results:
Test finds the proper kubelet.service and passes
Additional info:
Some commands have been here for so long and used regularly they are considered GA. Some commands are no longer that useful.
Description of the problem:
Right after installation, hub cluster indicated two clusters:
Status of local-agent-cluster-cluster-deployment is Detached.
Also there is no information about Labels, Nodes and Add-ons.
How reproducible:
100%
Steps to reproduce:
1. Deploy OCP 4.14 x86_64
2. Open cluster management console
3. Open All clusters view
Actual results:
Status of local-agent-cluster-cluster-deployment is Detached.
Expected results:
Status of local-agent-cluster-cluster-deployment is Ready.
Please review the following PR: https://github.com/openshift/cluster-update-keys/pull/52
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Manifests will be removed from CCO image so we have to start using CCA(cluster-config-api) image for bootstrap
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
KAS bootstrap container fails
Expected results:
KAS bootstrap container suceeds
Additional info:
Description of problem:
Environment file /etc/kubernetes/node.env is overwritten after node restart. There is a type in https://github.com/openshift/machine-config-operator/blob/master/templates/common/aws/files/usr-local-bin-aws-kubelet-nodename.yaml where variable should be changed to NODEENV wherever NODENV is found.
Version-Release number of selected component (if applicable):
How reproducible:
Easy
Steps to Reproduce:
1. Change contents of /etc/kubernetes/node.env 2. Restart node 3. Notice changes are lost
Actual results:
Expected results:
/etc/kubernetes/node.env should not be changed after restart of a node
Additional info:
Please review the following PR: https://github.com/openshift/gcp-pd-csi-driver-operator/pull/85
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-kube-apiserver-operator/pull/1589
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-25989. The following is the description of the original issue:
—
Description of problem:
Since OCP 4.15 we see issue with OLM deployed operator unable to operate in watched namespaces (multiple). It works fine with single watched namespace (subscription). Also, same test passes if we don't deploy operator using OLM, but using files. It seems like it is permission issue based on operator log. Same test works fine on any other previous OCP 4.14 and older.
Version-Release number of selected component (if applicable):
Server Version: 4.15.0-ec.3 Kubernetes Version: v1.28.3+20a5764
How reproducible:
Always
Steps to Reproduce:
0. oc login OCP4.15 1. git clone https://gitlab.cee.redhat.com/amq-broker/claire 2. make -f Makefile.downstream build ARTEMIS_VERSION=7.11.4 RELEASE_TYPE=released 3. make -f Makefile.downstream operator_test OLM_IIB=registry-proxy.engineering.redhat.com/rh-osbs/iib:636350 OLM_CHANNEL=7.11.x TESTS=ClusteredOperatorSmokeTests TEST_LOG_LEVEL=debug DISABLE_RANDOM_NAMESPACES=true
Actual results:
Can't deploy artemis broker custom resource in given namespace (permission issue - see details below)
Expected results:
Successfully deployed broker on watched namespaces
Additional info:
Log from AMQ Broker operator - seems like some permission issues since 4.15
E0103 10:04:54.425202 1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1beta1.ActiveMQArtemis: failed to list *v1beta1.ActiveMQArtemis: activemqartemises.broker.amq.io is forbidden: User "system:serviceaccount:cluster-tests:amq-broker-controller-manager" cannot list resource "activemqartemises" in API group "broker.amq.io" in the namespace "cluster-testsa" E0103 10:04:54.425207 1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1beta1.ActiveMQArtemisSecurity: failed to list *v1beta1.ActiveMQArtemisSecurity: activemqartemissecurities.broker.amq.io is forbidden: User "system:serviceaccount:cluster-tests:amq-broker-controller-manager" cannot list resource "activemqartemissecurities" in API group "broker.amq.io" in the namespace "cluster-testsa" E0103 10:04:54.425221 1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: Failed to watch *v1.Pod: failed to list *v1.Pod: pods is forbidden: User "system:serviceaccount:cluster-tests:amq-broker-controller-manager" cannot list resource "pods" in API group "" in the namespace "cluster-testsa" W0103 10:04:54.425296 1 reflector.go:324] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:250: failed to list *v1beta1.ActiveMQArtemisScaledown: activemqartemisscaledowns.broker.amq.io is forbidden: User "system:serviceaccount:cluster-tests:amq-broker-controller-manager" cannot list resource "activemqartemisscaledowns" in API group "broker.amq.io" in the namespace "cluster-testsa"
https://github.com/openshift/origin/pull/28360
Failing unit tests.
Every row has MasterNodesUpdated null, might have something to do with it. Fix would be in ci-tools.
Description of problem:
After installing an OpenShift IPI vSPhere cluter the coredns-monitor containers in the "openshift-vsphere-infra" namespace continuously report the message: "Failed to read ip from file /run/nodeip-configuration/ipv4" error="open /run/nodeip-configuration/ipv4: no such file or directory". The file "/run/nodeip-configuration/ipv4" present on the nodes is not actually moutned on the coredns pods. Apparently doesn't look to have any impact on the functionality of the cluster, but having a "failed" message on the container can triggers allarm or reserach for problem in the cluster.
Version-Release number of selected component (if applicable):
Any 4.12, 4.13, 4.14
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift IPI vSphere cluster 2. Wait forthe installation to complete 3. Read the logs of any coredns-monitor container in the "openshift-vsphere-infra" namespace
Actual results:
coredns-monitor continuously report the failed message, mesleading a cluster administartor for searching if there is a real issue.
Expected results:
coredns-monitor should not report this failed message if is not needed to fix it.
Additional info:
The same issue happens in Baremetal IPI clusters.
Description of problem:
Currently, vmware-vsphere-csi-driver-webhook exposes HTTP/2 endpoints: $ oc -n openshift-cluster-csi-drivers exec deployment/vmware-vsphere-csi-driver-webhook -- curl -kv https://localhost:8443/readyz ... * ALPN, server accepted to use h2 > GET /readyz HTTP/2 < HTTP/2 404 To err on the side of caution, we should discontinue the handling of HTTP/2 requests.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. oc -n openshift-cluster-csi-drivers exec deployment/vmware-vsphere-csi-driver-webhook -- curl -kv https://localhost:8443/readyz 2. 3.
Actual results:
HTTP/2 requests are accepted
Expected results:
HTTP/2 requests shouldn't be accepted by wehook
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
When I test PR https://github.com/openshift/machine-config-operator/pull/4083, there is no machineset does not have any machine linked. $ oc get machineset/rioliu-1220c-bz2gp-worker-f -n openshift-machine-api NAME DESIRED CURRENT READY AVAILABLE AGE rioliu-1220c-bz2gp-worker-f 0 0 3h47m Many errors found in MCD log like below I1220 09:15:59.743704 1 machine_set_boot_image_controller.go:211] Error syncing machineset openshift-machine-api/rioliu-1220c-bz2gp-worker-f: failed to fetch architecture type of machineset rioliu-1220c-bz2gp-worker-f, err: could not find any machines linked to machineset, error: %!w(<nil>) the machineset patch is skipped in reconcile loop due to above error, boot image info cannot be patched even it does not have any machine provisioned.
Version-Release number of selected component (if applicable):
How reproducible:
Consistently
Steps to Reproduce:
https://github.com/openshift/machine-config-operator/pull/4083#issuecomment-1864226629
Actual results:
the machineset is skipped in reconcile loop due to above error, boot image info cannot be patched
Expected results:
the machineset should be updated even no linked machine found, because maybe it is scaled down to 0 replica
Additional info:
This is a clone of issue OCPBUGS-27473. The following is the description of the original issue:
—
BuildRun logs cannot be displayed in the console and shows the following error:
The buildrun is created and started using the shp cli (similar behavior is observed when the build is created & started via console/yaml too):
shp build create goapp-buildah \ --strategy-name="buildah" \ --source-url="https://github.com/shipwright-io/sample-go" \ --source-context-dir="docker-build" \ --output-image="image-registry.openshift-image-registry.svc:5000/demo/go-app"
The issue occurs on OCP 4.14.6. Investigation showed that this works correctly on OCP 4.14.5.
Description of problem:
Pipeline Name gets changed to "new-pipeline" on the Edit Pipeline YAML/Builder
Version-Release number of selected component (if applicable):
Openshift 4.15 Pipelines Operator: 1.12.1
How reproducible:
Always when you are creating the tasks using YAML and then creating Pipeline with the tasks. (NOT OBSERVED WHEN USING THE PIPELINE BUILDER)
Steps to Reproduce:
1. Create Task 1: https://tekton.dev/docs/getting-started/tasks/#create-and-run-a-basic-task 2. Create Task 2: https://tekton.dev/docs/getting-started/pipelines/#create-and-run-a-second-task 3. Create Pipeline: https://tekton.dev/docs/getting-started/pipelines/#create-and-run-a-pipeline 4. Click "Edit Pipeline" from the Actions Menu
Actual results:
Pipeline Name gets changed to "new-pipeline" on the Edit Pipeline YAML/Builder, and cannot update the Pipeline.
Expected results:
The pipeline name shouldnot change.
Additional info:
Video : https://drive.google.com/file/d/19-dI8lSdH6tAZm3T8CQHw78P2AzdSIRv/view?usp=sharing
Description of problem:
Default security settings for new Azure Storage accounts be updated. Using ccoctl to create Azure Workload Identity resources in region eastus is not work. I found several commonly used regions and did the test. The test results are as follows. List of regions not working properly: eastus $ az storage account list -g mihuangtt0947-rg-oidc --query "[].[name,allowBlobPublicAccess]" -o tsv mihuangtt0947rgoidc False List of regions working properly: westus, australiacentral, australiaeast, centralus, australiasoutheast, southindia… $ az storage account list -g mihuangdispri0929-rg-oidc --query "[].[name,allowBlobPublicAccess]" -o tsv mihuangdispri0929rgoidc True
Version-Release number of selected component (if applicable):
4.14/4.15
How reproducible:
Always
Steps to Reproduce:
1.Running ccoctl azure create-all command to create azure workload identity resources in region eastus. [huangmingxia@fedora CCO-bugs]$ ./ccoctl azure create-all --name 'mihuangp1' --region 'eastus' --subscription-id {SUBSCRIPTION-ID} --tenant-id {TENANNT-ID} --credentials-requests-dir=./credrequests --dnszone-resource-group-name 'os4-common' --storage-account-name='mihuangp1oidc' --output-dir test
Actual results:
[huangmingxia@fedora CCO-bugs]$ ./ccoctl azure create-all --name 'mihuangp1' --region 'eastus' --subscription-id {SUBSCRIPTION-ID} --tenant-id {TENANNT-ID} --credentials-requests-dir=./credrequests --dnszone-resource-group-name 'os4-common' --storage-account-name='mihuangp1oidc' --output-dir test 2023/10/25 11:14:36 Using existing RSA keypair found at test/serviceaccount-signer.private 2023/10/25 11:14:36 Copying signing key for use by installer 2023/10/25 11:14:36 No --oidc-resource-group-name provided, defaulting OIDC resource group name to mihuangp1-oidc 2023/10/25 11:14:36 No --installation-resource-group-name provided, defaulting installation resource group name to mihuangp1 2023/10/25 11:14:36 No --blob-container-name provided, defaulting blob container name to mihuangp1 2023/10/25 11:14:39 Created resource group /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/mihuangp1-oidc 2023/10/25 11:15:01 Created storage account /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/mihuangp1-oidc/providers/Microsoft.Storage/storageAccounts/mihuangp1oidc 2023/10/25 11:15:03 failed to create blob container: PUT https://management.azure.com/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/mihuangp1-oidc/providers/Microsoft.Storage/storageAccounts/mihuangp1oidc/blobServices/default/containers/mihuangp1--------------------------------------------------------------------------------RESPONSE 409: 409 ConflictERROR CODE: PublicAccessNotPermitted--------------------------------------------------------------------------------{ "error": { "code": "PublicAccessNotPermitted", "message": "Public access is not permitted on this storage account.\nRequestId:415c51f1-c01e-0017-7ef1-06ec0c000000\nTime: 2023-10-25T03:15:02.7928767Z" }}-------------------------------------------------------------------------------- $ az storage account list -g mihuangtt0947-rg-oidc --query "[].[name,allowBlobPublicAccess]" -o tsvmihuangtt0947rgoidc False
Expected results:
Resources created successfully. $ az storage account list -g mihuangtt0947-rg-oidc --query "[].[name,allowBlobPublicAccess]" -o tsv mihuangtt0947rgoidc True
Additional info:
Google email: Important notice: Default security settings for new Azure Storage accounts will be updated
Description of problem:
After enabling user-defined monitoring on an HyperShift hosted cluster, PrometheusOperatorRejectedResources starts firing.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Start an hypershift-hosted cluster with cluster-bot 2. Enable user-defined monitoring 3.
Actual results:
PrometheusOperatorRejectedResources alert becomes firing
Expected results:
No alert firing
Additional info:
Need to reach out to the HyperShift folks as the fix should probably be in their code base.
Description of problem:
if pipefail is active in a bash script, the pipe ( | ) usage can hide the actual error of the ip command if it fails with exit code different from 1
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Actually the issue is same root cause of https://issues.redhat.com/browse/OCPBUGS-9026 but I'd like to open new one since the issue becomes very critical after ROSA using NLB as default since 4.14, HCP(HyperShift) private cluster that without infra nodes is the serious victim because it has worker nodes only and no available workaround for it now. But if we think we could use the old bug to track the issue, then please close this one.
Version-Release number of selected component (if applicable):
4.14.1 HyperShift Private cluster
How reproducible:
100%
Steps to Reproduce:
1. create ROSA HCP(HyperShift) cluster 2. run qe-e2e-test on this cluster, or curl route from one pod inside the cluster 3.
Actual results:
1. co/console status is flapping since route is intermittently accessible $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.1 True False 4h56m Error while reconciling 4.14.1: the cluster operator console is not available 2. check node and router pods running on both worker nodes $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-49-184.ec2.internal Ready worker 5h5m v1.27.6+f67aeb3 ip-10-0-63-210.ec2.internal Ready worker 5h8m v1.27.6+f67aeb3 $ oc -n openshift-ingress get pod -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES router-default-86d569bf84-bq66f 1/1 Running 0 5h8m 10.130.0.7 ip-10-0-49-184.ec2.internal <none> <none> router-default-86d569bf84-v54hp 1/1 Running 0 5h8m 10.128.0.9 ip-10-0-63-210.ec2.internal <none> <none> 3. check ingresscontroller LB setting, it uses Internal NLB spec: endpointPublishingStrategy: loadBalancer: dnsManagementPolicy: Managed providerParameters: aws: networkLoadBalancer: {} type: NLB type: AWS scope: Internal type: LoadBalancerService 4. continue to curl the route from a pod inside the cluster $ oc rsh console-operator-86786df488-w6fks Defaulted container "console-operator" out of: console-operator, conversion-webhook-server sh-4.4$ curl https://console-openshift-console.apps.rosa.ci-rosa-h-d53b.ptk5.p3.openshiftapps.com -k -I HTTP/1.1 200 OK sh-4.4$ curl https://console-openshift-console.apps.rosa.ci-rosa-h-d53b.ptk5.p3.openshiftapps.com -k -I Connection timed out
Expected results:
1. co/console should be stable, curl console route should be always OK. 2. qe-e2e-test should not fail
Additional info:
qe-e2e-test on the cluster: https://qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gs/qe-private-deck/pr-logs/pull/openshift_release/45369/rehearse-45369-periodic-ci-openshift-openshift-tests-private-release-4.14-amd64-stable-aws-rosa-sts-hypershift-sec-guest-prod-private-link-full-f2/1724307074235502592
Description of problem:
We have a customer on OCP 4.10.47, using OVN-K8S in local gateway mode requiring either updating or adding an additional default route. The question we have is whether there is a way to do this using the interface hints such that the new default route would have a higher/better priority then the day-0 default route and on node reboot and/or cluster upgrade, this does not affect OVN (based on the interface hints, OVN can use the original default route even though it would have a lower priority..
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
the pull-ci-openshift-ovn-kubernetes-master-e2e-aws-ovn-local-to-shared-gateway-mode-migration job started failing recently when the
ovnkube-master daemonset would not finish rolling out after 360s.
taking the must gather to debug which happens a few minutes after the test
failure you can see that the daemonset is still not ready, so I believe that
increasing the timeout is not the answer.
some debug info:
➜ static-kas git:(master) oc --kubeconfig=/tmp/kk get daemonsets -A NAMESPACE NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE openshift-cluster-csi-drivers aws-ebs-csi-driver-node 6 6 6 6 6 kubernetes.io/os=linux 8h openshift-cluster-node-tuning-operator tuned 6 6 6 6 6 kubernetes.io/os=linux 8h openshift-dns dns-default 6 6 6 6 6 kubernetes.io/os=linux 8h openshift-dns node-resolver 6 6 6 6 6 kubernetes.io/os=linux 8h openshift-image-registry node-ca 6 6 6 6 6 kubernetes.io/os=linux 8h openshift-ingress-canary ingress-canary 3 3 3 3 3 kubernetes.io/os=linux 8h openshift-machine-api machine-api-termination-handler 0 0 0 0 0 kubernetes.io/os=linux,machine.openshift.io/interruptible-instance= 8h openshift-machine-config-operator machine-config-daemon 6 6 6 6 6 kubernetes.io/os=linux 8h openshift-machine-config-operator machine-config-server 3 3 3 3 3 node-role.kubernetes.io/master= 8h openshift-monitoring node-exporter 6 6 6 6 6 kubernetes.io/os=linux 8h openshift-multus multus 6 6 6 6 6 kubernetes.io/os=linux 9h openshift-multus multus-additional-cni-plugins 6 6 6 6 6 kubernetes.io/os=linux 9h openshift-multus network-metrics-daemon 6 6 6 6 6 kubernetes.io/os=linux 9h openshift-network-diagnostics network-check-target 6 6 6 6 6 beta.kubernetes.io/os=linux 9h openshift-ovn-kubernetes ovnkube-master 3 3 2 2 2 beta.kubernetes.io/os=linux,node-role.kubernetes.io/master= 9h openshift-ovn-kubernetes ovnkube-node 6 6 6 6 6 beta.kubernetes.io/os=linux 9h Name: ovnkube-master Selector: app=ovnkube-master Node-Selector: beta.kubernetes.io/os=linux,node-role.kubernetes.io/master= Labels: networkoperator.openshift.io/generates-operator-status=stand-alone Annotations: deprecated.daemonset.template.generation: 3 kubernetes.io/description: This daemonset launches the ovn-kubernetes controller (master) networking components. networkoperator.openshift.io/cluster-network-cidr: 10.128.0.0/14 networkoperator.openshift.io/hybrid-overlay-status: disabled networkoperator.openshift.io/ip-family-mode: single-stack release.openshift.io/version: 4.14.0-0.ci.test-2023-08-04-123014-ci-op-c6fp05f4-latest Desired Number of Nodes Scheduled: 3 Current Number of Nodes Scheduled: 3 Number of Nodes Scheduled with Up-to-date Pods: 2 Number of Nodes Scheduled with Available Pods: 2 Number of Nodes Misscheduled: 0 Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed Pod Template: Labels: app=ovnkube-master component=network kubernetes.io/os=linux openshift.io/component=network ovn-db-pod=true type=infra Annotations: networkoperator.openshift.io/cluster-network-cidr: 10.128.0.0/14 networkoperator.openshift.io/hybrid-overlay-status: disabled networkoperator.openshift.io/ip-family-mode: single-stack target.workload.openshift.io/management: {"effect": "PreferredDuringScheduling"} Service Account: ovn-kubernetes-controller
it seems there is one pod that is not coming up all the way and that pod has
two containers not ready (sbdb and nbdb). logs from those containers below:
➜ static-kas git:(master) oc --kubeconfig=/tmp/kk describe pod ovnkube-master-7qlm5 -n openshift-ovn-kubernetes | rg '^ [a-z].*:|Ready' northd: Ready: True nbdb: Ready: False kube-rbac-proxy: Ready: True sbdb: Ready: False ovnkube-master: Ready: True ovn-dbchecker: Ready: True ➜ static-kas git:(master) oc --kubeconfig=/tmp/kk logs ovnkube-master-7qlm5 -n openshift-ovn-kubernetes -c sbdb 2023-08-04T13:08:49.127480354Z + [[ -f /env/_master ]] 2023-08-04T13:08:49.127562165Z + trap quit TERM INT 2023-08-04T13:08:49.127609496Z + ovn_kubernetes_namespace=openshift-ovn-kubernetes 2023-08-04T13:08:49.127637926Z + ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt' 2023-08-04T13:08:49.127637926Z + transport=ssl 2023-08-04T13:08:49.127645167Z + ovn_raft_conn_ip_url_suffix= 2023-08-04T13:08:49.127682687Z + [[ 10.0.42.108 == \: ]] 2023-08-04T13:08:49.127690638Z + db=sb 2023-08-04T13:08:49.127690638Z + db_port=9642 2023-08-04T13:08:49.127712038Z + ovn_db_file=/etc/ovn/ovnsb_db.db 2023-08-04T13:08:49.127854181Z + [[ ! ssl:10.0.102.2:9642,ssl:10.0.42.108:9642,ssl:10.0.74.128:9642 =~ .:10\.0\.42\.108:. ]] 2023-08-04T13:08:49.128199437Z ++ bracketify 10.0.42.108 2023-08-04T13:08:49.128237768Z ++ case "$1" in 2023-08-04T13:08:49.128265838Z ++ echo 10.0.42.108 2023-08-04T13:08:49.128493242Z + OVN_ARGS='--db-sb-cluster-local-port=9644 --db-sb-cluster-local-addr=10.0.42.108 --no-monitor --db-sb-cluster-local-proto=ssl --ovn-sb-db-ssl-key=/ovn-cert/tls.key --ovn-sb-db-ssl-cert=/ovn-cert/tls.crt --ovn-sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt' 2023-08-04T13:08:49.128535253Z + CLUSTER_INITIATOR_IP=10.0.102.2 2023-08-04T13:08:49.128819438Z ++ date -Iseconds 2023-08-04T13:08:49.130157063Z 2023-08-04T13:08:49+00:00 - starting sbdb CLUSTER_INITIATOR_IP=10.0.102.2 2023-08-04T13:08:49.130170893Z + echo '2023-08-04T13:08:49+00:00 - starting sbdb CLUSTER_INITIATOR_IP=10.0.102.2' 2023-08-04T13:08:49.130170893Z + initialize=false 2023-08-04T13:08:49.130179713Z + [[ ! -e /etc/ovn/ovnsb_db.db ]] 2023-08-04T13:08:49.130318475Z + [[ false == \t\r\u\e ]] 2023-08-04T13:08:49.130406657Z + wait 9 2023-08-04T13:08:49.130493659Z + exec /usr/share/ovn/scripts/ovn-ctl -db-sb-cluster-local-port=9644 --db-sb-cluster-local-addr=10.0.42.108 --no-monitor --db-sb-cluster-local-proto=ssl --ovn-sb-db-ssl-key=/ovn-cert/tls.key --ovn-sb-db-ssl-cert=/ovn-cert/tls.crt --ovn-sb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '-ovn-sb-log=-vconsole:info -vfile:off -vPATTERN:console:%D {%Y-%m-%dT%H:%M:%S.###Z} |%05N|%c%T|%p|%m' run_sb_ovsdb 2023-08-04T13:08:49.208399304Z 2023-08-04T13:08:49.208Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-sb.log 2023-08-04T13:08:49.213507987Z ovn-sbctl: unix:/var/run/ovn/ovnsb_db.sock: database connection failed (No such file or directory) 2023-08-04T13:08:49.224890005Z 2023-08-04T13:08:49Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting... 2023-08-04T13:08:49.224912156Z 2023-08-04T13:08:49Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connection attempt failed (No such file or directory) 2023-08-04T13:08:49.255474964Z 2023-08-04T13:08:49.255Z|00002|raft|INFO|local server ID is 7f92 2023-08-04T13:08:49.333342909Z 2023-08-04T13:08:49.333Z|00003|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 3.1.2 2023-08-04T13:08:49.348948944Z 2023-08-04T13:08:49.348Z|00004|reconnect|INFO|ssl:10.0.102.2:9644: connecting... 2023-08-04T13:08:49.349002565Z 2023-08-04T13:08:49.348Z|00005|reconnect|INFO|ssl:10.0.74.128:9644: connecting... 2023-08-04T13:08:49.352510569Z 2023-08-04T13:08:49.352Z|00006|reconnect|INFO|ssl:10.0.102.2:9644: connected 2023-08-04T13:08:49.353870484Z 2023-08-04T13:08:49.353Z|00007|reconnect|INFO|ssl:10.0.74.128:9644: connected 2023-08-04T13:08:49.889326777Z 2023-08-04T13:08:49.889Z|00008|raft|INFO|server 2501 is leader for term 5 2023-08-04T13:08:49.890316765Z 2023-08-04T13:08:49.890Z|00009|raft|INFO|rejecting append_request because previous entry 5,1538 not in local log (mismatch past end of log) 2023-08-04T13:08:49.891199951Z 2023-08-04T13:08:49.891Z|00010|raft|INFO|rejecting append_request because previous entry 5,1539 not in local log (mismatch past end of log) 2023-08-04T13:08:50.225632838Z 2023-08-04T13:08:50Z|00003|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connecting... 2023-08-04T13:08:50.225677739Z 2023-08-04T13:08:50Z|00004|reconnect|INFO|unix:/var/run/ovn/ovnsb_db.sock: connected 2023-08-04T13:08:50.227772827Z Waiting for OVN_Southbound to come up. 2023-08-04T13:08:55.716284614Z 2023-08-04T13:08:55.716Z|00011|raft|INFO|ssl:10.0.74.128:43498: learned server ID 3dff 2023-08-04T13:08:55.716323395Z 2023-08-04T13:08:55.716Z|00012|raft|INFO|ssl:10.0.74.128:43498: learned remote address ssl:10.0.74.128:9644 2023-08-04T13:08:55.724570375Z 2023-08-04T13:08:55.724Z|00013|raft|INFO|ssl:10.0.102.2:47804: learned server ID 2501 2023-08-04T13:08:55.724599466Z 2023-08-04T13:08:55.724Z|00014|raft|INFO|ssl:10.0.102.2:47804: learned remote address ssl:10.0.102.2:9644 2023-08-04T13:08:59.348572779Z 2023-08-04T13:08:59.348Z|00015|memory|INFO|32296 kB peak resident set size after 10.1 seconds 2023-08-04T13:08:59.348648190Z 2023-08-04T13:08:59.348Z|00016|memory|INFO|atoms:35959 cells:31476 monitors:0 n-weak-refs:749 raft-connections:4 raft-log:1543 txn-history:100 txn-history-atoms:7100 ➜ static-kas git:(master) oc --kubeconfig=/tmp/kk logs ovnkube-master-7qlm5 -n openshift-ovn-kubernetes -c nbdb 2023-08-04T13:08:48.779743434Z + [[ -f /env/_master ]] 2023-08-04T13:08:48.779743434Z + trap quit TERM INT 2023-08-04T13:08:48.779825516Z + ovn_kubernetes_namespace=openshift-ovn-kubernetes 2023-08-04T13:08:48.779825516Z + ovndb_ctl_ssl_opts='-p /ovn-cert/tls.key -c /ovn-cert/tls.crt -C /ovn-ca/ca-bundle.crt' 2023-08-04T13:08:48.779825516Z + transport=ssl 2023-08-04T13:08:48.779825516Z + ovn_raft_conn_ip_url_suffix= 2023-08-04T13:08:48.779825516Z + [[ 10.0.42.108 == \: ]] 2023-08-04T13:08:48.779825516Z + db=nb 2023-08-04T13:08:48.779825516Z + db_port=9641 2023-08-04T13:08:48.779825516Z + ovn_db_file=/etc/ovn/ovnnb_db.db 2023-08-04T13:08:48.779887606Z + [[ ! ssl:10.0.102.2:9641,ssl:10.0.42.108:9641,ssl:10.0.74.128:9641 =~ .:10\.0\.42\.108:. ]] 2023-08-04T13:08:48.780159182Z ++ bracketify 10.0.42.108 2023-08-04T13:08:48.780167142Z ++ case "$1" in 2023-08-04T13:08:48.780172102Z ++ echo 10.0.42.108 2023-08-04T13:08:48.780314224Z + OVN_ARGS='--db-nb-cluster-local-port=9643 --db-nb-cluster-local-addr=10.0.42.108 --no-monitor --db-nb-cluster-local-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt' 2023-08-04T13:08:48.780314224Z + CLUSTER_INITIATOR_IP=10.0.102.2 2023-08-04T13:08:48.780518588Z ++ date -Iseconds 2023-08-04T13:08:48.781738820Z 2023-08-04T13:08:48+00:00 - starting nbdb CLUSTER_INITIATOR_IP=10.0.102.2, K8S_NODE_IP=10.0.42.108 2023-08-04T13:08:48.781753021Z + echo '2023-08-04T13:08:48+00:00 - starting nbdb CLUSTER_INITIATOR_IP=10.0.102.2, K8S_NODE_IP=10.0.42.108' 2023-08-04T13:08:48.781753021Z + initialize=false 2023-08-04T13:08:48.781753021Z + [[ ! -e /etc/ovn/ovnnb_db.db ]] 2023-08-04T13:08:48.781816342Z + [[ false == \t\r\u\e ]] 2023-08-04T13:08:48.781936684Z + wait 9 2023-08-04T13:08:48.781974715Z + exec /usr/share/ovn/scripts/ovn-ctl -db-nb-cluster-local-port=9643 --db-nb-cluster-local-addr=10.0.42.108 --no-monitor --db-nb-cluster-local-proto=ssl --ovn-nb-db-ssl-key=/ovn-cert/tls.key --ovn-nb-db-ssl-cert=/ovn-cert/tls.crt --ovn-nb-db-ssl-ca-cert=/ovn-ca/ca-bundle.crt '-ovn-nb-log=-vconsole:info -vfile:off -vPATTERN:console:%D {%Y-%m-%dT%H:%M:%S.###Z} |%05N|%c%T|%p|%m' run_nb_ovsdb 2023-08-04T13:08:48.851644059Z 2023-08-04T13:08:48.851Z|00001|vlog|INFO|opened log file /var/log/ovn/ovsdb-server-nb.log 2023-08-04T13:08:48.852091247Z ovn-nbctl: unix:/var/run/ovn/ovnnb_db.sock: database connection failed (No such file or directory) 2023-08-04T13:08:48.861365357Z 2023-08-04T13:08:48Z|00001|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2023-08-04T13:08:48.861365357Z 2023-08-04T13:08:48Z|00002|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connection attempt failed (No such file or directory) 2023-08-04T13:08:48.875126148Z 2023-08-04T13:08:48.875Z|00002|raft|INFO|local server ID is c503 2023-08-04T13:08:48.911846610Z 2023-08-04T13:08:48.911Z|00003|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 3.1.2 2023-08-04T13:08:48.918864408Z 2023-08-04T13:08:48.918Z|00004|reconnect|INFO|ssl:10.0.102.2:9643: connecting... 2023-08-04T13:08:48.918934490Z 2023-08-04T13:08:48.918Z|00005|reconnect|INFO|ssl:10.0.74.128:9643: connecting... 2023-08-04T13:08:48.923439162Z 2023-08-04T13:08:48.923Z|00006|reconnect|INFO|ssl:10.0.102.2:9643: connected 2023-08-04T13:08:48.925166154Z 2023-08-04T13:08:48.925Z|00007|reconnect|INFO|ssl:10.0.74.128:9643: connected 2023-08-04T13:08:49.861650961Z 2023-08-04T13:08:49Z|00003|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connecting... 2023-08-04T13:08:49.861747153Z 2023-08-04T13:08:49Z|00004|reconnect|INFO|unix:/var/run/ovn/ovnnb_db.sock: connected 2023-08-04T13:08:49.875272530Z 2023-08-04T13:08:49.875Z|00008|raft|INFO|server fccb is leader for term 6 2023-08-04T13:08:49.875302480Z 2023-08-04T13:08:49.875Z|00009|raft|INFO|rejecting append_request because previous entry 6,1732 not in local log (mismatch past end of log) 2023-08-04T13:08:49.876027164Z Waiting for OVN_Northbound to come up. 2023-08-04T13:08:55.694760761Z 2023-08-04T13:08:55.694Z|00010|raft|INFO|ssl:10.0.74.128:57122: learned server ID d382 2023-08-04T13:08:55.694800872Z 2023-08-04T13:08:55.694Z|00011|raft|INFO|ssl:10.0.74.128:57122: learned remote address ssl:10.0.74.128:9643 2023-08-04T13:08:55.706904913Z 2023-08-04T13:08:55.706Z|00012|raft|INFO|ssl:10.0.102.2:43230: learned server ID fccb 2023-08-04T13:08:55.706931733Z 2023-08-04T13:08:55.706Z|00013|raft|INFO|ssl:10.0.102.2:43230: learned remote address ssl:10.0.102.2:9643 2023-08-04T13:08:58.919567770Z 2023-08-04T13:08:58.919Z|00014|memory|INFO|21944 kB peak resident set size after 10.1 seconds 2023-08-04T13:08:58.919643762Z 2023-08-04T13:08:58.919Z|00015|memory|INFO|atoms:8471 cells:7481 monitors:0 n-weak-refs:200 raft-connections:4 raft-log:1737 txn-history:72 txn-history-atoms:8165 ➜ static-kas git:(master)
This seems to happen very frequently now, but was not happening before around July 21st.
Description of problem:
While installing many SNOs via ZTP using ACM, two SNOs failed to complete install because the image-registry was degraded during the install process. # cat clusters | xargs -I % sh -c "echo '%'; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get clusterversion" vm01831 NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False False 18h Error while reconciling 4.14.0-rc.0: the cluster operator image-registry is degraded vm02740 NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version False False 18h Error while reconciling 4.14.0-rc.0: the cluster operator image-registry is degraded # cat clusters | xargs -I % sh -c "echo '%'; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get co image-registry" vm01831 NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE image-registry 4.14.0-rc.0 True False True 18h Degraded: The registry is removed... vm02740 NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE image-registry 4.14.0-rc.0 True False True 18h Degraded: The registry is removed... Both showed the image-pruner job pod in error state: # cat clusters | xargs -I % sh -c "echo '%'; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get po -n openshift-image-registry" vm01831 NAME READY STATUS RESTARTS AGE cluster-image-registry-operator-5d497944d4-czn64 1/1 Running 0 18h image-pruner-28242720-w6jmv 0/1 Error 0 18h node-ca-vtfj8 1/1 Running 0 18h vm02740 NAME READY STATUS RESTARTS AGE cluster-image-registry-operator-5d497944d4-lbtqw 1/1 Running 1 (18h ago) 18h image-pruner-28242720-ltqzk 0/1 Error 0 18h node-ca-4fntj 1/1 Running 0 18h
Version-Release number of selected component (if applicable):
Deployed SNO OCP - 4.14.0-rc.0 Hub 4.13.11 ACM - 2.9.0-DOWNSTREAM-2023-09-07-04-47-52
How reproducible:
Rare, only 2 clusters were found in this state after the test
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Seems like some permissions might have been lacking: # oc --kubeconfig /root/hv-vm/kc/vm01831/kubeconfig logs -n openshift-image-registry image-pruner-28242720-w6jmv Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found attempt #1 has failed (exit code 1), going to make another attempt... Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found attempt #2 has failed (exit code 1), going to make another attempt... Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found attempt #3 has failed (exit code 1), going to make another attempt... Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found attempt #4 has failed (exit code 1), going to make another attempt... Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found attempt #5 has failed (exit code 1), going to make another attempt... Error from server (Forbidden): pods is forbidden: User "system:serviceaccount:openshift-image-registry:pruner" cannot list resource "pods" in API group "" at the cluster scope: RBAC: clusterrole.rbac.authorization.k8s.io "system:image-pruner" not found
Description of problem:
When building images, items such as the /run/secrets/redhat.repo file from the build container are bind-mounted into the rootfs of the image being built for the benefit of RUN instructions. For a privileged build, the fact that the bind includes the nodev/noexec/nosuid flags doesn't cause any problems. When attempting the build without privileges, where the source file (itself mounted into the build container from the host) is not owned by the user the builder container is running as, this can fail because the kernel won't allow a bind mount that tries to remove any of these flags, and the logic which handled transient mounts when using chroot isolation wasn't taking enough care to avoid that possibility.
Version-Release number of selected component (if applicable):
buildah-1.32.0 and earlier
How reproducible:
Always
Steps to Reproduce:
1. On a single-node setup, `touch` /etc/yum.repos.d/redhat.repo, which is the target of a symbolic link in /usr/share/rhel/secrets, which /usr/share/containers/mounts.conf tells CRI-O should have its contents exposed in containers. 2. Attempt to build this spec: {{ apiVersion: build.openshift.io/v1 kind: Build metadata: name: unprivileged spec: source: type: Dockerfile dockerfile: | FROM registry.fedoraproject.org/fedora-minimal RUN find /run/secrets -ls RUN head /proc/self/uid_map /proc/self/gid_map /run/secrets/redhat.repo strategy: type: Docker dockerStrategy: env: - name: BUILD_PRIVILEGED value: "false" }} 3.
Actual results:
error running subprocess: remounting "/tmp/buildahXXX/mnt/rootfs/run/secrets/redhat.repo" in mount namespace with expected flags: operation not permitted
Expected results:
No such mount error. Depending on the permissions on the file, the unprivileged build may still fail if it attempts to use the contents of that file, but that's not a bug in the builder so much as a consequence of access controls.
Additional info:
unknown machine config node can be listed, the name is not in current cluster, in my cluster, there are 6 nodes, but I can see 10 machine config nodes
// current node $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-12-209.us-east-2.compute.internal Ready worker 3h48m v1.28.3+59b90bd ip-10-0-23-177.us-east-2.compute.internal Ready control-plane,master 3h54m v1.28.3+59b90bd ip-10-0-32-216.us-east-2.compute.internal Ready control-plane,master 3h54m v1.28.3+59b90bd ip-10-0-42-207.us-east-2.compute.internal Ready worker 53m v1.28.3+59b90bd ip-10-0-71-71.us-east-2.compute.internal Ready worker 3h46m v1.28.3+59b90bd ip-10-0-81-190.us-east-2.compute.internal Ready control-plane,master 3h54m v1.28.3+59b90bd // current mcn $ oc get machineconfignode NAME UPDATED UPDATEPREPARED UPDATEEXECUTED UPDATEPOSTACTIONCOMPLETE UPDATECOMPLETE RESUMED ip-10-0-12-209.us-east-2.compute.internal True False False False False False ip-10-0-23-177.us-east-2.compute.internal True False False False False False ip-10-0-32-216.us-east-2.compute.internal True False False False False False ip-10-0-42-207.us-east-2.compute.internal True False False False False False ip-10-0-53-5.us-east-2.compute.internal True False False False False False ip-10-0-56-84.us-east-2.compute.internal True False False False False False ip-10-0-58-210.us-east-2.compute.internal True False False False False False ip-10-0-58-99.us-east-2.compute.internal False True True Unknown False False ip-10-0-71-71.us-east-2.compute.internal True False False False False False ip-10-0-81-190.us-east-2.compute.internal True False False False False False
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-04-162702
How reproducible:
Consistently
Steps to Reproduce:
1. setup cluster with 4.15.0-0.nightly-2023-12-04-162702 on aws 2. enable featureSet: TechPreviewNoUpgrade 3. apply file based mc few times. 4. check node list 5. check machine config node list
Actual results:
there are some unknown machine config nodes found
Expected results:
machine config node number should be same as cluster node number
Additional info:
must-gather: https://drive.google.com/file/d/1-VTismwXXZ9sYMHi8hDL7vhwzjuMn92n/view?usp=drive_link
Description of problem:
We want to understand our users, but the first page the user opens wasn't tracked.
Version-Release number of selected component (if applicable):
Saw this on Dev Sandbox with 4.10 and 4.11 with enabled telemetry
How reproducible:
Sometimes! Looks like a race condition and requires active telemetry
Steps to Reproduce:
1. Open the browser network inspector and filter for segment
2. Open the developer console
Actual results:
1-2 identity event is send, but no page event
Expected results:
At least one identity event and at least one page event should be send to segment
Additional info:
Please review the following PR: https://github.com/openshift/cluster-storage-operator/pull/398
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/alibaba-cloud-csi-driver/pull/33
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-25771. The following is the description of the original issue:
—
Description of problem:
Check on OperatorHub page, the long catalogsource display name will overflow the operator item tile Version-Release number of selected component (if applicable):{code:none} 4.15.0-0.nightly-2023-12-19-033450
How reproducible:
Always
Steps to Reproduce:
1. Create a catalogsource with a long display name. 2. Check operator items supplied by the created catalogsource on OperatorHub page 3.
Actual results:
2. The catalogsource display name overflows from the item tile
Expected results:
2. Show show catalogsource display name in the item tile dynamically without overflow.
Additional info:
screenshot: https://drive.google.com/file/d/1GOHJOxoBmtZX3QWDsIvc2RT5a2inkpzM/view?usp=sharing
Please review the following PR: https://github.com/openshift/images/pull/149
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/ironic-rhcos-downloader/pull/93
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-5113. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-25055. The following is the description of the original issue:
—
Description of problem:
No detail failure on signature verification while failing to validate signature of the target release payload during upgrade. It's unclear for user to know which action could be taken for the failure. For example, checking if any wrong configmap set, or default store is not available or any issue on custom store? # ./oc adm upgrade Cluster version is 4.15.0-0.nightly-2023-12-08-202155 Upgradeable=False Reason: FeatureGates_RestrictedFeatureGates_TechPreviewNoUpgrade Message: Cluster operator config-operator should not be upgraded between minor versions: FeatureGatesUpgradeable: "TechPreviewNoUpgrade" does not allow updates ReleaseAccepted=False Reason: RetrievePayload Message: Retrieving payload failed version="4.15.0-0.nightly-2023-12-09-012410" image="registry.ci.openshift.org/ocp/release@sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7" failure=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat Upstream: https://amd64.ocp.releases.ci.openshift.org/graph Channel: stable-4.15 Recommended updates: VERSION IMAGE 4.15.0-0.nightly-2023-12-09-012410 registry.ci.openshift.org/ocp/release@sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 # ./oc -n openshift-cluster-version logs cluster-version-operator-6b7b5ff598-vxjrq|grep "verified"|tail -n4 I1211 09:28:22.755834 1 sync_worker.go:434] loadUpdatedPayload syncPayload err=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat I1211 09:28:22.755974 1 event.go:298] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RetrievePayloadFailed' Retrieving payload failed version="4.15.0-0.nightly-2023-12-09-012410" image="registry.ci.openshift.org/ocp/release@sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7" failure=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat I1211 09:28:37.817102 1 sync_worker.go:434] loadUpdatedPayload syncPayload err=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat I1211 09:28:37.817488 1 event.go:298] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RetrievePayloadFailed' Retrieving payload failed version="4.15.0-0.nightly-2023-12-09-012410" image="registry.ci.openshift.org/ocp/release@sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7" failure=The update cannot be verified: unable to verify sha256:0bc9978f420a152a171429086853e80f033e012e694f9a762eee777f5a7fb4f7 against keyrings: verifier-public-key-redhat
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-08-202155
How reproducible:
always
Steps to Reproduce:
1. trigger an fresh installation with tp enabled(no spec.signaturestores property set by default) 2.trigger an upgrade against a nightly build(no signature available in default signature store) 3.
Actual results:
no detail log on signature verification failure
Expected results:
include detail failure on signature verification in the cvo log
Additional info:
https://github.com/openshift/cluster-version-operator/pull/1003
Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/1882
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Issue 53 from https://docs.google.com/spreadsheets/d/1TR3ENY-GE_LQL9F-xH6NahtRHdu6IrYGhLqDDA8y0EI/edit#gid=1035185624
Topology > Pod rings are missing for Deployments
Screenshot: https://drive.google.com/file/d/1RXCMKjvu2mdO2tQeHe-p5mLbINfmP5u4/view?usp=drive_link
Description of problem:
We need to update Console dynamic plugin build infra (@openshift-console/dynamic-plugin-sdk-webpack) to support sharing of PatternFly 5 dynamic modules between dynamic plugins, as per CONSOLE-3853.
This change is necessary for optimal performance of Console plugins that wish to migrate to PatternFly 5.
Description of problem:
On February 27th endpoints were turned off that were being queried for account details. The check is not vital so we are fine with removing it, however it is currently blocking all Power VS installs.
Version-Release number of selected component (if applicable):
4.13.0 - 4.16.0
How reproducible:
Easily
Steps to Reproduce:
1. Try to deploy with Power VS 2. Fail at the platform credentials check
Actual results:
Check fails
Expected results:
Check should succeed
Additional info:
elastic APM seems to be unused
The dockerfile provided is not configured properly and does not load the files generated by a build command.
After Patternfly5 Update: Knative Service Name Bar not visible in Topology view
Refer this:
https://drive.google.com/file/d/1_KAotzs4WC8g2oW0OymTA_cGm-xabXlq/view?usp=sharing
After updating the cluster to 4.12.42 (from 4.12.15), the customer noticed some issues for the scheduled PODs to start on the node.
The initial thought was a multus issue, and then we realised that the script /usr/local/bin/configure-ovs.sh was modified and reverting the modification fixed the issue.
Modification:
> if nmcli connection show "$vlan_parent" &> /dev/null; then > # if the VLAN connection is configured with a connection UUID as parent, we need to find the underlying device > # and create the bridge against it, as the parent connection can be replaced by another bridge. > vlan_parent=$(nmcli --get-values GENERAL.DEVICES conn show ${vlan_parent}) > fi
Reference:
4.12.42
Should be reproducible by setting inactive nmcli connections with the same names as the active once
Not tested, but this should be something like
1. create inactive same nmcli connections
2. run the script
Script failing
Script should manage the connection using the UUID instead of using the Name.
Or maybe it's an underline issue how nmcli is managing the relationship between objects.
The issue may be related to the way that nmcli is working, as it should use the UUID to match the `vlan.parent` as it does with the `connection.master`
After creating a 4.14 ARO cluster, some cluster operators are not available because load balancer can't be created.
It is because of the change of the default value of vmType in cloud-provider-azure.
https://github.com/kubernetes-sigs/cloud-provider-azure/pull/4214
In ARO, we use standard vmType and don't use any vmss as a cluster node, but installer doesn't specify vmType, which causes vmType mismatch and cloud-provider-azure can't configure load balancer.
We would like it to make vmType default `standard` or to have an option to change it via install config or something.
discussion thread: https://redhat-internal.slack.com/archives/C68TNFWA2/p1700814868246649
Reproducible steps:
Create an 4.14 ARO cluster. Creating a normal cluster with standard vm in Azure might also reproduce the issue
What I got:
❯ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.14.1 False True True 21m OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.atokubi.eastus.osadev.cloud/healthz": context deadline exceeded (Client.Timeout exceeded while awaiting headers)... cloud-controller-manager 4.14.1 True False False 24m cloud-credential 4.14.1 True False False 26m cluster-autoscaler 4.14.1 True False False 20m config-operator 4.14.1 True False False 21m console 4.14.1 False True False 13m DeploymentAvailable: 0 replicas available for console deployment... control-plane-machine-set 4.14.1 True False False 14m csi-snapshot-controller 4.14.1 True False False 20m dns 4.14.1 True False False 20m etcd 4.14.1 True False False 19m image-registry 4.14.1 True False False 8m11s ingress False True True 7m36s The "default" ingress controller reports Available=False: IngressControllerUnavailable: One or more status conditions indicate unavailable: LoadBalancerReady=False (SyncLoadBalancerFailed: The service-controller component is reporting SyncLoadBalancerFailed events like: Error syncing load balancer: failed to ensure load balancer: failed to map VM Name to NodeName: VM Name atokubi-vnkt5-master-0... insights 4.14.1 True False False 14m kube-apiserver 4.14.1 True True False 10m NodeInstallerProgressing: 1 nodes are at revision 5; 2 nodes are at revision 6 kube-controller-manager 4.14.1 True False False 18m kube-scheduler 4.14.1 True False False 17m kube-storage-version-migrator 4.14.1 True False False 21m machine-api 4.14.1 True False False 11m machine-approver 4.14.1 True False False 20m machine-config 4.14.1 True False False 15m marketplace 4.14.1 True False False 20m monitoring 4.14.1 True False False 6m53s network 4.14.1 True False False 22m node-tuning 4.14.1 True False False 20m openshift-apiserver 4.14.1 True False False 14m openshift-controller-manager 4.14.1 True False False 20m openshift-samples 4.14.1 True False False 14m operator-lifecycle-manager 4.14.1 True False False 20m operator-lifecycle-manager-catalog 4.14.1 True False False 20m operator-lifecycle-manager-packageserver 4.14.1 True False False 14m service-ca 4.14.1 True False False 21m storage 4.14.1 True False False 20m
❯ oc get svc -A | grep LoadBalancer
openshift-ingress router-default LoadBalancer 172.30.43.24 <pending> 80:32538/TCP,443:31115/TCP 38m
❯ oc get cm cloud-provider-config -n openshift-config -oyaml apiVersion: v1 data: config: '{"cloud":"AzurePublicCloud","tenantId":"<reducted>","aadClientId":"","aadClientSecret":"","aadClientCertPath":"","aadClientCertPassword":"","useManagedIdentityExtension":false,"userAssignedIdentityID":"","subscriptionId":"<reducted>","resourceGroup":"aro-atokubi","location":"eastus","vnetName":"dev-vnet","vnetResourceGroup":"v4-eastus","subnetName":"atokubi-worker","securityGroupName":"atokubi-vnkt5-nsg","routeTableName":"atokubi-vnkt5-node-routetable","primaryAvailabilitySetName":"","vmType":"","primaryScaleSetName":"","cloudProviderBackoff":true,"cloudProviderBackoffRetries":0,"cloudProviderBackoffExponent":0,"cloudProviderBackoffDuration":6,"cloudProviderBackoffJitter":0,"cloudProviderRateLimit":false,"cloudProviderRateLimitQPS":0,"cloudProviderRateLimitBucket":0,"cloudProviderRateLimitQPSWrite":0,"cloudProviderRateLimitBucketWrite":0,"useInstanceMetadata":true,"loadBalancerSku":"standard","excludeMasterFromStandardLB":false,"disableOutboundSNAT":true,"maximumLoadBalancerRuleCount":0}' kind: ConfigMap metadata: creationTimestamp: "2023-11-29T10:08:19Z" name: cloud-provider-config namespace: openshift-config resourceVersion: "33363" uid: 8b35cf3f-65ee-428d-92e6-304165301e96
❯ oc logs azure-cloud-controller-manager-fbdfbdb86-hk646 -n openshift-cloud-controller-manager Defaulted container "cloud-controller-manager" out of: cloud-controller-manager, azure-inject-credentials (init) <omitted> I1129 10:46:47.401672 1 controller.go:388] Ensuring load balancer for service openshift-ingress/router-default I1129 10:46:47.401732 1 azure_loadbalancer.go:122] reconcileService: Start reconciling Service "openshift-ingress/router-default" with its resource basename "ac376ce0f66164eebb9fc0fa76a9c697" I1129 10:46:47.401742 1 azure_loadbalancer.go:1533] reconcileLoadBalancer for service(openshift-ingress/router-default) - wantLb(true): started I1129 10:46:47.401849 1 event.go:307] "Event occurred" object="openshift-ingress/router-default" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer" I1129 10:46:47.505374 1 azure_loadbalancer_repo.go:73] LoadBalancerClient.List(aro-atokubi) success I1129 10:46:47.573290 1 azure_loadbalancer.go:1557] reconcileLoadBalancer for service(openshift-ingress/router-default): lb(aro-atokubi/atokubi-vnkt5) wantLb(true) resolved load balancer name I1129 10:46:47.643053 1 azure_vmssflex_cache.go:162] Could not find node () in the existing cache. Forcely freshing the cache to check again... E1129 10:46:47.716774 1 azure_vmssflex.go:379] fs.GetNodeNameByIPConfigurationID(/subscriptions/fe16a035-e540-4ab7-80d9-373fa9a3d6ae/resourceGroups/aro-atokubi/providers/Microsoft.Network/networkInterfaces/atokubi-vnkt5-master0-nic/ipConfigurations/pipConfig) failed. Error: failed to map VM Name to NodeName: VM Name atokubi-vnkt5-master-0 E1129 10:46:47.716802 1 azure_loadbalancer.go:126] reconcileLoadBalancer(openshift-ingress/router-default) failed: failed to map VM Name to NodeName: VM Name atokubi-vnkt5-master-0 I1129 10:46:47.716835 1 azure_metrics.go:115] "Observed Request Latency" latency_seconds=0.315082823 request="services_ensure_loadbalancer" resource_group="aro-atokubi" subscription_id="fe16a035-e540-4ab7-80d9-373fa9a3d6ae" source="openshift-ingress/router-default" result_code="failed_ensure_loadbalancer" E1129 10:46:47.716866 1 controller.go:291] error processing service openshift-ingress/router-default (will retry): failed to ensure load balancer: failed to map VM Name to NodeName: VM Name atokubi-vnkt5-master-0 I1129 10:46:47.716964 1 event.go:307] "Event occurred" object="openshift-ingress/router-default" fieldPath="" kind="Service" apiVersion="v1" type="Warning" reason="SyncLoadBalancerFailed" message="Error syncing load balancer: failed to ensure load balancer: failed to map VM Name to NodeName: VM Name atokubi-vnkt5-master-0"
After changing vmType from empty to "standard" in cloud-provider-config, it can configure load balancer and errors are gone.
Description of problem:
pods assigned with Multus whereabouts IP get stuck in ContainerCreating state after OCP upgrading from 4.12.15 to 4.12.22. Not sure if upgrading cause the issue or node rebooting directly cause the issue. The error message is: (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox mypod-0-0-1-0_testproject_8c8500e1-1643-4716-8fd7-e032292c62ab_0(2baa045a1b19291769ed56bab288b60802179ff3138ffe0d16a14e78f9cb5e4f): error adding pod testproject_mypod-0-0-1-0 to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [testproject/mypod-0-0-1-0/8c8500e1-1643-4716-8fd7-e032292c62ab:testproject-net-svc-kernel-bond]: error adding container to network "testproject-net-svc-kernel-bond": error at storage engine: k8s get error: client rate limiter Wait returned an error: rate: Wait(n=1) would exceed context deadline
Version-Release number of selected component (if applicable):
How reproducible:
Not sure if it is reproducible
Steps to Reproduce:
1. 2. 3.
Actual results:
Pods stuck in ContainerCreating state
Expected results:
Pods creates normally
Additional info:
Customer responded deleting statefulset and recreated it didn't work. The pods can be created normally after deleting corresponding ippools.whereabouts.cni.cncf.io manually $ oc delete ippools.whereabouts.cni.cncf.io 172.21.24.0-22 -n openshift-multus
Please review the following PR: https://github.com/openshift/cluster-bootstrap/pull/100
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-cloud-controller-manager-operator/pull/302
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-29858. The following is the description of the original issue:
—
The convention is a format like node-role.kubernetes.io/role: "", not node-role.kubernetes.io: role, however ROSA uses the latter format to indicate the infra role. This changes the node watch code to ignore it, as well as other potential variations like node-role.kubernetes.io/.
The current code panics when run against a ROSA cluster:
{{ E0209 18:10:55.533265 78 runtime.go:79] Observed a panic: runtime.boundsError{x:24, y:23, signed:true, code:0x3} (runtime error: slice bounds out of range [24:23])
goroutine 233 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x7a71840?, 0xc0018e2f48})
k8s.io/apimachinery@v0.27.2/pkg/util/runtime/runtime.go:75 +0x99
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0x1000251f9fe?})
k8s.io/apimachinery@v0.27.2/pkg/util/runtime/runtime.go:49 +0x75
panic({0x7a71840, 0xc0018e2f48})
runtime/panic.go:884 +0x213
github.com/openshift/origin/pkg/monitortests/node/watchnodes.nodeRoles(0x7ecd7b3?)
github.com/openshift/origin/pkg/monitortests/node/watchnodes/node.go:187 +0x1e5
github.com/openshift/origin/pkg/monitortests/node/watchnodes.startNodeMonitoring.func1(0}}
Please review the following PR: https://github.com/openshift/agent-installer-utils/pull/31
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem: Panic on machine-controller
2023-11-23T18:18:47.899851056Z I1123 18:18:47.899752 1 controller.go:115] "msg"="Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference" "controller"="machine-controller" "name"="bogus-6121tjfqk-cpr4v" "namespace"="openshift-machine-api" "object"={"name":"bogus-6121tjfqk-cpr4v","namespace":"openshift-machine-api"} "reconcileID"="38050b3e-3313-4500-8955-59f6822fd650" 2023-11-23T18:18:47.901976792Z panic: runtime error: invalid memory address or nil pointer dereference [recovered] 2023-11-23T18:18:47.901976792Z panic: runtime error: invalid memory address or nil pointer dereference 2023-11-23T18:18:47.901976792Z [signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x27fcb31] 2023-11-23T18:18:47.902001202Z 2023-11-23T18:18:47.902001202Z goroutine 261 [running]: 2023-11-23T18:18:47.902001202Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1() 2023-11-23T18:18:47.902001202Z /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:116 +0x1fa 2023-11-23T18:18:47.902013625Z panic({0x2ab4640, 0x4373ed0}) 2023-11-23T18:18:47.902022923Z /usr/lib/golang/src/runtime/panic.go:884 +0x213 2023-11-23T18:18:47.902043867Z github.com/openshift/machine-api-provider-openstack/pkg/machine.extractRootVolumeFromProviderSpec(...) 2023-11-23T18:18:47.902043867Z /go/src/sigs.k8s.io/cluster-api-provider-openstack/pkg/machine/convert.go:211 2023-11-23T18:18:47.902053364Z github.com/openshift/machine-api-provider-openstack/pkg/machine.(*OpenstackClient).Delete(0xc0000bfab0, {0x3113ff0?, 0xc000605ec0?}, 0xc00065fd40) 2023-11-23T18:18:47.902062370Z /go/src/sigs.k8s.io/cluster-api-provider-openstack/pkg/machine/actuator.go:335 +0x1b1 2023-11-23T18:18:47.902082577Z github.com/openshift/machine-api-operator/pkg/controller/machine.(*ReconcileMachine).Reconcile(0xc000304aa0, {0x3113ff0, 0xc000605ec0}, {{{0xc000d66a50?, 0x0?}, {0xc000d66a38?, 0xc00043cd48?}}}) 2023-11-23T18:18:47.902117667Z /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/github.com/openshift/machine-api-operator/pkg/controller/machine/controller.go:216 +0x1dee 2023-11-23T18:18:47.902139450Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x31181b8?, {0x3113ff0?, 0xc000605ec0?}, {{{0xc000d66a50?, 0xb?}, {0xc000d66a38?, 0x0?}}}) 2023-11-23T18:18:47.902166210Z /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119 +0xc8 2023-11-23T18:18:47.902186773Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0005488c0, {0x3113f48, 0xc000350550}, {0x2b9b6a0?, 0xc000475760?}) 2023-11-23T18:18:47.902196557Z /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316 +0x3ca 2023-11-23T18:18:47.902205655Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0005488c0, {0x3113f48, 0xc000350550}) 2023-11-23T18:18:47.902214747Z /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266 +0x1d9 2023-11-23T18:18:47.902223782Z sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2() 2023-11-23T18:18:47.902223782Z /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227 +0x85 2023-11-23T18:18:47.902233237Z created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 2023-11-23T18:18:47.902242150Z /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:223 +0x587
The bogus machine bogus-6121tjfqk-cpr4v was created by openstack-test "[sig-installer][Suite:openshift/openstack] Bugfix bz_2073398: [Serial] MachineSet scale-in does not leak OpenStack ports" which was run before and passed.
Version-Release number of selected component (if applicable):
How reproducible: Observed once.
Additional info: must-gather provided on private comment
Description of problem:
OCP 4.15 nightly deployment on a Bare-metal servers without using the provisioning network is stuck during deployment. Job history: https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-telco5g Deployment stuck similiar to this: Upstream job logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.15-e2e-telco5g/1732520780954079232/artifacts/e2e-telco5g/telco5g-cluster-setup/artifacts/cloud-init-output.log ~~~ level=debug msg=ironic_node_v1.openshift-master-host[2]: Creating...level=debug msg=ironic_node_v1.openshift-master-host[0]: Creating...level=debug msg=ironic_node_v1.openshift-master-host[1]: Creating...level=debug msg=ironic_node_v1.openshift-master-host[0]: Still creating... [10s elapsed]..level=debug msg=ironic_node_v1.openshift-master-host[0]: Still creating... [2h28m51s elapsed]level=debug msg=ironic_node_v1.openshift-master-host[1]: Still creating... [2h28m51s elapsed] ~~~ Ironic logs from bootstrap node: ~~~ Dec 07 13:10:13 localhost.localdomain start-provisioning-nic.sh[3942]: Error: failed to modify ipv4.addresses: invalid IP address: Invalid IPv4 address ''. Dec 07 13:10:13 localhost.localdomain systemd[1]: provisioning-interface.service: Main process exited, code=exited, status=2/INVALIDARGUMENT Dec 07 13:10:13 localhost.localdomain systemd[1]: provisioning-interface.service: Failed with result 'exit-code'. Dec 07 13:10:13 localhost.localdomain systemd[1]: Failed to start Provisioning interface. Dec 07 13:10:13 localhost.localdomain systemd[1]: Dependency failed for DHCP Service for Provisioning Network. Dec 07 13:10:13 localhost.localdomain systemd[1]: ironic-dnsmasq.service: Job ironic-dnsmasq.service/start failed with result 'dependency' ~~~
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Everytime
Steps to Reproduce:
1.Deploy OCP More information about our setup: In our environment, We have 3 virtual master node, 1 virtual worker and 1 baremetal worker. We use KCLI tool for creation of the virtual environment and for running the deployment workflow using IPI, In our setup we don't use provisioning network. (Same setup is used for other OCP version till 4.14 and are working fine.) We have attached our install-config.yaml (for RH employees) and logs from bootstrap node.
Actual results:
Deployment is failing Dec 07 13:10:13 localhost.localdomain start-provisioning-nic.sh[3942]: Error: failed to modify ipv4.addresses: invalid IP address: Invalid IPv4 address ''.
Expected results:
Deployment should pass
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/multus-admission-controller/pull/69
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-driver-manila-operator/pull/204
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-26488. The following is the description of the original issue:
—
Description of problem:
CCO reports credsremoved mode in metrics when the cluster is actually in the default mode. See https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/47349/rehearse-47349-pull-ci-openshift-cloud-credential-operator-release-4.16-e2e-aws-qe/1744240905512030208 (OCP-31768).
Version-Release number of selected component (if applicable):
4.16
How reproducible:
Always.
Steps to Reproduce:
1. Creates an AWS cluster with CCO in the default mode (ends up in mint) 2. Get the value of the cco_credentials_mode metric
Actual results:
credsremoved
Expected results:
mint
Root cause:
The controller-runtime client used in metrics calculator (https://github.com/openshift/cloud-credential-operator/blob/77a68ad01e75162bfa04097b22f80d305c192439/pkg/operator/metrics/metrics.go#L77) is unable to GET the root credentials Secret (https://github.com/openshift/cloud-credential-operator/blob/77a68ad01e75162bfa04097b22f80d305c192439/pkg/operator/metrics/metrics.go#L184) since it is backed by a cache which only contains target Secrets requested by other operators (https://github.com/openshift/cloud-credential-operator/blob/77a68ad01e75162bfa04097b22f80d305c192439/pkg/cmd/operator/cmd.go#L164-L168).
Work has been done in Gophercloud; we now need to bump Gophercloud in Installer.
Description of problem:
When mirroring a multiarch release payload through oc adm release mirror --keep-manifest-list --to-image-stream into an image stream of a cluster's internal registry, the cluster does not import the image as a manifest list.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. oc adm release mirror \ --from=quay.io/openshift-release-dev/ocp-release:4.14.0-rc.5-multi \ --to-image-stream=release \ --keep-manifest-list=true 2. oc get istag release:installer -o yaml 3.
Actual results:
apiVersion: image.openshift.io/v1 generation: 1 image: dockerImageLayers: - mediaType: application/vnd.docker.image.rootfs.diff.tar.gzip name: sha256:97da74cc6d8fa5d1634eb1760fd1da5c6048619c264c23e62d75f3bf6b8ef5c4 size: 79524639 - mediaType: application/vnd.docker.image.rootfs.diff.tar.gzip name: sha256:d8190195889efb5333eeec18af9b6c82313edd4db62989bd3a357caca4f13f0e size: 1438 - mediaType: application/vnd.docker.image.rootfs.diff.tar.gzip name: sha256:09c3f3b6718f2df2ee9cd3a6c2e19ddb73ca777f216d310eaf4e0420407ea7c7 size: 59044444 - mediaType: application/vnd.docker.image.rootfs.diff.tar.gzip name: sha256:cf84754d71b4b704c30abd45668882903e3eaa1355857b605e1dbb25ecf516d7 size: 11455659 - mediaType: application/vnd.docker.image.rootfs.diff.tar.gzip name: sha256:2e20a50f4b685b3976028637f296ae8839c18a9505b5f58d6e4a0f03984ef1e8 size: 433281528 dockerImageManifestMediaType: application/vnd.docker.distribution.manifest.v2+json dockerImageMetadata: Architecture: amd64 Config: Entrypoint: - /bin/openshift-install Env: - container=oci - GODEBUG=x509ignoreCN=0,madvdontneed=1 - __doozer=merge - BUILD_RELEASE=202310100645.p0.gc926532.assembly.stream - BUILD_VERSION=v4.15.0 - OS_GIT_MAJOR=4 - OS_GIT_MINOR=15 - OS_GIT_PATCH=0 - OS_GIT_TREE_STATE=clean - OS_GIT_VERSION=4.15.0-202310100645.p0.gc926532.assembly.stream-c926532 - SOURCE_GIT_TREE_STATE=clean - __doozer_group=openshift-4.15 - __doozer_key=ose-installer - OS_GIT_COMMIT=c926532 - SOURCE_DATE_EPOCH=1696907019 - SOURCE_GIT_COMMIT=c926532cd50b6ef4974f14dfe3d877a0f7707972 - SOURCE_GIT_TAG=agent-installer-v4.11.0-dev-preview-2-2165-gc926532cd5 - SOURCE_GIT_URL=https://github.com/openshift/installer - PATH=/bin - HOME=/output Labels: License: GPLv2+ architecture: x86_64 build-date: 2023-10-10T10:01:18 com.redhat.build-host: cpt-1001.osbs.prod.upshift.rdu2.redhat.com com.redhat.component: ose-installer-container com.redhat.license_terms: https://www.redhat.com/agreements description: This is the base image from which all OpenShift Container Platform images inherit. distribution-scope: public io.buildah.version: 1.29.0 io.k8s.description: This is the base image from which all OpenShift Container Platform images inherit. io.k8s.display-name: OpenShift Container Platform RHEL 8 Base io.openshift.build.commit.id: c926532cd50b6ef4974f14dfe3d877a0f7707972 io.openshift.build.commit.url: https://github.com/openshift/installer/commit/c926532cd50b6ef4974f14dfe3d877a0f7707972 io.openshift.build.source-location: https://github.com/openshift/installer io.openshift.expose-services: "" io.openshift.maintainer.component: Installer / openshift-installer io.openshift.maintainer.project: OCPBUGS io.openshift.release.operator: "true" io.openshift.tags: openshift,base maintainer: Red Hat, Inc. name: openshift/ose-installer release: 202310100645.p0.gc926532.assembly.stream summary: Provides the latest release of the Red Hat Extended Life Base Image. url: https://access.redhat.com/containers/#/registry.access.redhat.com/openshift/ose-installer/images/v4.15.0-202310100645.p0.gc926532.assembly.stream vcs-ref: d40a2800e169f6c2d63897467af22d59933e8811 vcs-type: git vendor: Red Hat, Inc. version: v4.15.0 User: 1000:1000 WorkingDir: /output ContainerConfig: {} Created: "2023-10-10T10:59:36Z" Id: sha256:ae4c47d3c08de5d57b5d4fa8a30497ac097c05abab4e284c91eae389e512f202 Size: 583326767 apiVersion: image.openshift.io/1.0 kind: DockerImage dockerImageMetadataVersion: "1.0" dockerImageReference: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:67d35b2185c9f267523f86e54f403d0d2561c9098b7bb81fa3bfd6fd8a121d04 metadata: annotations: image.openshift.io/dockerLayersOrder: ascending creationTimestamp: "2023-10-11T10:56:53Z" name: sha256:67d35b2185c9f267523f86e54f403d0d2561c9098b7bb81fa3bfd6fd8a121d04 resourceVersion: "740341" uid: 17dede63-ca3b-47ad-a157-c78f38c1df7d kind: ImageStreamTag lookupPolicy: local: true metadata: creationTimestamp: "2023-10-12T09:32:10Z" name: release:installer namespace: okd-fcos resourceVersion: "1329147" uid: d6cfcd4d-3f9c-4bb1-bc56-04bf5e926628 tag: annotations: null from: kind: DockerImage name: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c510f0e2bd29f7b9bf45146fbc212e893634179cc029cd54a135f05f9ae1df52 generation: 12 importPolicy: importMode: Legacy name: installer referencePolicy: type: Source
Expected results:
apiVersion: image.openshift.io/v1 generation: 12 image: dockerImageManifestMediaType: application/vnd.docker.distribution.manifest.list.v2+json dockerImageManifests: - architecture: amd64 digest: sha256:67d35b2185c9f267523f86e54f403d0d2561c9098b7bb81fa3bfd6fd8a121d04 manifestSize: 1087 mediaType: application/vnd.docker.distribution.manifest.v2+json os: linux - architecture: arm64 digest: sha256:a602c3e4b5f8f747b2813ed2166f366417f638fc6884deecebdb04e18431fcd6 manifestSize: 1087 mediaType: application/vnd.docker.distribution.manifest.v2+json os: linux - architecture: ppc64le digest: sha256:04296057a8f037f20d4b1ca20bcaac5bdca5368cdd711a3f37bd05d66c9fdaec manifestSize: 1087 mediaType: application/vnd.docker.distribution.manifest.v2+json os: linux - architecture: s390x digest: sha256:5fda4ea09bfd2026b7d6acd80441b2b7c51b1cf440fd46e0535a7320b67894fb manifestSize: 1087 mediaType: application/vnd.docker.distribution.manifest.v2+json os: linux dockerImageMetadata: ContainerConfig: {} Created: "2023-10-12T09:32:03Z" Id: sha256:c510f0e2bd29f7b9bf45146fbc212e893634179cc029cd54a135f05f9ae1df52 apiVersion: image.openshift.io/1.0 kind: DockerImage dockerImageMetadataVersion: "1.0" dockerImageReference: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c510f0e2bd29f7b9bf45146fbc212e893634179cc029cd54a135f05f9ae1df52 metadata: creationTimestamp: "2023-10-12T09:32:10Z" name: sha256:c510f0e2bd29f7b9bf45146fbc212e893634179cc029cd54a135f05f9ae1df52 resourceVersion: "1327949" uid: 4d78c9ba-12b2-414f-a173-b926ae019ab0 kind: ImageStreamTag lookupPolicy: local: true metadata: creationTimestamp: "2023-10-12T09:32:10Z" name: release:installer namespace: okd-fcos resourceVersion: "1329147" uid: d6cfcd4d-3f9c-4bb1-bc56-04bf5e926628 tag: annotations: null from: kind: DockerImage name: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:c510f0e2bd29f7b9bf45146fbc212e893634179cc029cd54a135f05f9ae1df52 generation: 12 importPolicy: importMode: PreserveOriginal name: installer referencePolicy: type: Source
Additional info:
Description of problem:
An infra object in some vsphere deployments can look like this:
~]$ oc get infrastructure cluster -o json | jq .status { "apiServerInternalURI": "xxx", "apiServerURL": "xxx", "controlPlaneTopology": "HighlyAvailable", "etcdDiscoveryDomain": "", "infrastructureName": "xxx", "infrastructureTopology": "HighlyAvailable", "platform": "VSphere", "platformStatus": { "type": "VSphere" } }
Which if we attempt to run the regenerate MCO command in https://access.redhat.com/articles/regenerating_cluster_certificates will cause a panic
Version-Release number of selected component (if applicable):
4.10.65 4.11.47 4.12.29 4.13.8 4.14.0 4.15
How reproducible:
100%
Steps to Reproduce:
1. Run procedure on cluster with above infra 2. 3.
Actual results:
panic
Expected results:
no panic
Additional info:
Please review the following PR: https://github.com/openshift/cloud-provider-aws/pull/48
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-capi-operator/pull/129
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
We need to backport https://github.com/cri-o/cri-o/pull/7744 into 1.28 of crio. CI is failing on upgrades due to a feature not in 1.28.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-28718. The following is the description of the original issue:
—
Description of problem:
In service details page, under Revision and Route tabs, user is able to see No resource found message although Revision and Route is created for that service
Version-Release number of selected component (if applicable):
4.15.z
How reproducible:
Always
Steps to Reproduce:
1.Install serverless operator 2.Create serving instance 3.Create knative service/ function 4.Go to details page
Actual results:
User is not able to see Revision and Route created for the service
Expected results:
User should be able to see Revision and Route created for the service
Additional info:
Description of the problem:
Assisted installer doesn't freeze and unmount file systems used for overwriting os image.
This causes the file system to become corrupt.
How reproducible:
Always for ZTP flow.
Steps to reproduce:
1. Run ZTP with enable-skip-mco-reboot set to true
2.
3.
Actual results:
Installation fails. Host drops to emergency shell.
Expected results:
Successful installation.
Please review the following PR: https://github.com/openshift/cluster-authentication-operator/pull/634
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/cluster-image-registry-operator/pull/918
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/oc/pull/1546
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/csi-external-attacher/pull/65
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
description for ClusterOperatorDown, there is one $ before {{ $labels.reason }}
$ oc -n openshift-cluster-version get prometheusrules cluster-version-operator -oyaml .... - alert: ClusterOperatorDown annotations: description: The {{ $labels.name }} operator may be down or disabled because ${{ $labels.reason }}, and the components it manages may be unavailable or degraded. Cluster upgrades may not complete. For more information refer to 'oc get -o yaml clusteroperator {{ $labels.name }}'{{ with $console_url := "console_url" | query }}{{ if ne (len (label "url" (first $console_url ) ) ) 0}} or {{ label "url" (first $console_url ) }}/settings/cluster/{{ end }}{{ end }}. summary: Cluster operator has not been available for 10 minutes. expr: | max by (namespace, name, reason) (cluster_operator_up{job="cluster-version-operator"} == 0) for: 10m labels: severity: critical
the description is like below if ClusterOperatorDown alert is fired
The insights operator may be down or disabled because $UploadFailed,and the components it manages....
if it's intended, we could close this bug
Version-Release number of selected component (if applicable):
4.13.0-0.nightly-2023-02-27-101545
How reproducible:
always
(This is a clone of OCPBUGS-24009 targeting 4.15)
TRT has picked up a somewhat rare but new failure coming out of the packageserver operator, it surfaces in this test. It appears to only be affecting Azure 4.14 -> 4.15 (aka minor) upgrades, seems to be roughly 5% of the time.
Examining job runs where this test failed in sippy we can see the error output is typically:
operator conditions operator-lifecycle-manager-packageserver expand_less 0s {Operator unavailable (ClusterServiceVersionNotSucceeded): ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver observed in phase Failed with reason: InstallComponentFailed, message: install strategy failed: clusterrolebindings.rbac.authorization.k8s.io "packageserver-service-system:auth-delegator" already exists Operator unavailable (ClusterServiceVersionNotSucceeded): ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver observed in phase Failed with reason: InstallComponentFailed, message: install strategy failed: clusterrolebindings.rbac.authorization.k8s.io "packageserver-service-system:auth-delegator" already exists}
or
{Operator unavailable (ClusterServiceVersionNotSucceeded): ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver observed in phase Failed with reason: InstallComponentFailed, message: install strategy failed: could not create service packageserver-service: services "packageserver-service" already exists Operator unavailable (ClusterServiceVersionNotSucceeded): ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver observed in phase Failed with reason: InstallComponentFailed, message: install strategy failed: could not create service packageserver-service: services "packageserver-service" already exists}
The failed job runs also indicate this problem appears to have started, or started occurring far more frequently, somewhere around Nov 14 - Nov 18. It's been very common since the 18th happening multiple times a day.
Please review the following PR: https://github.com/openshift/cluster-kube-scheduler-operator/pull/513
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/installer/pull/7493
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Issue 44 from https://docs.google.com/spreadsheets/d/1TR3ENY-GE_LQL9F-xH6NahtRHdu6IrYGhLqDDA8y0EI/edit#gid=1035185624
The Observe tab has Metric and Events within an Accordian component blue border is against the side bar container. Either remove it (currently) or add spacing between
Screenshot: https://drive.google.com/file/d/1i8SMUwTYXZL4CG0r1UXnxnm5e8QdAhQK/view?usp=sharing
Description of problem:
The bubble box with wrong layout
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-11-16-110328
How reproducible:
Always
Steps to Reproduce:
1. Make sure there is no pod under your using project 2. navigate to Networking -> NetworkPolicies -> Create NetworkPolicy page, click the 'affected pods' in Pod selector section 3. Check the layout in the bubble component
Actual results:
the layout is in correct (shared file:https://drive.google.com/file/d/1I8e2ZkiFO2Gu4nSt9kJ6JmRG3LdvkE-u/view?usp=drive_link )
Expected results:
layout should correct
Additional info:
Please review the following PR: https://github.com/openshift/cluster-api/pull/188
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
After updating a CPMS CR with a non-existent network a machine is stuck in provisioning state. The when updating the CPMS with the previous one the Master Machine is stuck in deleting state Logs from the machine api controller: I0720 13:03:58.894171 1 controller.go:187] ostest-2pwfk-master-xwprn-0: reconciling Machine I0720 13:03:58.902876 1 controller.go:231] ostest-2pwfk-master-xwprn-0: reconciling machine triggers delete E0720 13:04:00.200290 1 controller.go:255] ostest-2pwfk-master-xwprn-0: failed to delete machine: filter matched no resources E0720 13:04:00.200499 1 controller.go:329] "msg"="Reconciler error" "error"="filter matched no resources" "controller"="machine-controller" "name"="ostest-2pwfk-master-xwprn-0" "namespace"="openshift-machine-api" "object"={"name":"ostest-2pwfk-master-xwprn-0","namespace":"openshift-machine-api"} "reconcileID"="9ccb5885-4b9f-4190-95a2-1120f2566c52"
Version-Release number of selected component (if applicable):
OCP 4.14.0-0.nightly-2023-07-18-085740 RHOS-17.1-RHEL-9-20230712.n.1
How reproducible:
100%
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/monitoring-plugin/pull/75
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-24186. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Installation with Kuryr is failing because multiple components are attempting to connect to the API and fail with the following error: failed checking apiserver connectivity: Get "https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-service-ca/leases/service-ca-controller-lock": tls: failed to verify certificate: x509: cannot validate certificate for 172.30.0.1 because it doesn't contain any IP SANs $ oc get po -A -o wide |grep -v Running |grep -v Pending |grep -v Completed NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES openshift-apiserver-operator openshift-apiserver-operator-559d855c56-c2rdr 0/1 CrashLoopBackOff 42 (2m28s ago) 3h44m 10.128.16.86 kuryr-5sxhw-master-2 <none> <none> openshift-apiserver apiserver-6b9f5d48c4-bj6s6 0/2 CrashLoopBackOff 92 (4m25s ago) 3h36m 10.128.70.10 kuryr-5sxhw-master-2 <none> <none> openshift-cluster-csi-drivers manila-csi-driver-operator-75b64d8797-fckf5 0/1 CrashLoopBackOff 42 (119s ago) 3h41m 10.128.56.21 kuryr-5sxhw-master-0 <none> <none> openshift-cluster-csi-drivers openstack-cinder-csi-driver-operator-84dfd8d89f-kgtr8 0/1 CrashLoopBackOff 42 (82s ago) 3h41m 10.128.56.9 kuryr-5sxhw-master-0 <none> <none> openshift-cluster-node-tuning-operator cluster-node-tuning-operator-7fbb66545c-kh6th 0/1 CrashLoopBackOff 46 (3m5s ago) 3h44m 10.128.6.40 kuryr-5sxhw-master-2 <none> <none> openshift-cluster-storage-operator cluster-storage-operator-5545dfcf6d-n497j 0/1 CrashLoopBackOff 42 (2m23s ago) 3h44m 10.128.21.175 kuryr-5sxhw-master-2 <none> <none> openshift-cluster-storage-operator csi-snapshot-controller-ddb9469f9-bc4bb 0/1 CrashLoopBackOff 45 (2m17s ago) 3h41m 10.128.20.106 kuryr-5sxhw-master-1 <none> <none> openshift-cluster-storage-operator csi-snapshot-controller-operator-6d7b66dbdd-xdwcs 0/1 CrashLoopBackOff 42 (92s ago) 3h44m 10.128.21.220 kuryr-5sxhw-master-2 <none> <none> openshift-config-operator openshift-config-operator-c5d5d964-2w2bv 0/1 CrashLoopBackOff 80 (3m39s ago) 3h44m 10.128.43.39 kuryr-5sxhw-master-2 <none> <none> openshift-controller-manager-operator openshift-controller-manager-operator-754d748cf7-rzq6f 0/1 CrashLoopBackOff 42 (3m6s ago) 3h44m 10.128.25.166 kuryr-5sxhw-master-2 <none> <none> openshift-etcd-operator etcd-operator-76ddc94887-zqkn7 0/1 CrashLoopBackOff 49 (30s ago) 3h44m 10.128.32.146 kuryr-5sxhw-master-2 <none> <none> openshift-ingress-operator ingress-operator-9f76cf75b-cjx9t 1/2 CrashLoopBackOff 39 (3m24s ago) 3h44m 10.128.9.108 kuryr-5sxhw-master-2 <none> <none> openshift-insights insights-operator-776cd7cfb4-8gzz7 0/1 CrashLoopBackOff 46 (4m21s ago) 3h44m 10.128.15.102 kuryr-5sxhw-master-2 <none> <none> openshift-kube-apiserver-operator kube-apiserver-operator-64f4db777f-7n9jv 0/1 CrashLoopBackOff 42 (113s ago) 3h44m 10.128.18.199 kuryr-5sxhw-master-2 <none> <none> openshift-kube-apiserver installer-5-kuryr-5sxhw-master-1 0/1 Error 0 3h35m 10.128.68.176 kuryr-5sxhw-master-1 <none> <none> openshift-kube-controller-manager-operator kube-controller-manager-operator-746497b-dfbh5 0/1 CrashLoopBackOff 42 (2m23s ago) 3h44m 10.128.13.162 kuryr-5sxhw-master-2 <none> <none> openshift-kube-controller-manager installer-4-kuryr-5sxhw-master-0 0/1 Error 0 3h35m 10.128.65.186 kuryr-5sxhw-master-0 <none> <none> openshift-kube-scheduler-operator openshift-kube-scheduler-operator-695fb4449f-j9wqx 0/1 CrashLoopBackOff 42 (63s ago) 3h44m 10.128.44.194 kuryr-5sxhw-master-2 <none> <none> openshift-kube-scheduler installer-5-kuryr-5sxhw-master-0 0/1 Error 0 3h35m 10.128.60.44 kuryr-5sxhw-master-0 <none> <none> openshift-kube-storage-version-migrator-operator kube-storage-version-migrator-operator-6c5cd46578-qpk5z 0/1 CrashLoopBackOff 42 (2m18s ago) 3h44m 10.128.4.120 kuryr-5sxhw-master-2 <none> <none> openshift-machine-api cluster-autoscaler-operator-7b667675db-tmlcb 1/2 CrashLoopBackOff 46 (2m53s ago) 3h45m 10.128.28.146 kuryr-5sxhw-master-2 <none> <none> openshift-machine-api machine-api-controllers-fdb99649c-ldb7t 3/7 CrashLoopBackOff 184 (2m55s ago) 3h40m 10.128.29.90 kuryr-5sxhw-master-0 <none> <none> openshift-route-controller-manager route-controller-manager-d8f458684-7dgjm 0/1 CrashLoopBackOff 43 (100s ago) 3h36m 10.128.55.11 kuryr-5sxhw-master-2 <none> <none> openshift-service-ca-operator service-ca-operator-654f68c77f-g4w55 0/1 CrashLoopBackOff 42 (2m2s ago) 3h45m 10.128.22.30 kuryr-5sxhw-master-2 <none> <none> openshift-service-ca service-ca-5f584b7d75-mxllm 0/1 CrashLoopBackOff 42 (45s ago) 3h42m 10.128.49.250 kuryr-5sxhw-master-0 <none> <none>
$ oc get svc -A |grep 172.30.0.1 default kubernetes ClusterIP 172.30.0.1 <none> 443/TCP 3h50m
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/1963
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-25843. The following is the description of the original issue:
—
Description of problem:
On customer feedback modal, there are 3 links for user to feedback to Red Hat, the third link lacks a title.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-21-155123
How reproducible:
Always
Steps to Reproduce:
1.Login admin console. Click on "?"->"Share Feedback", check the links on the modal 2. 3.
Actual results:
1. The third link lacks a link title (the link for "Learn about opportunities to ……").
Expected results:
1. There is link title "Inform the direction of Red Hat" in 4.14, it should also exists for 4.15.
Additional info:
screenshot for 4.14 page: https://drive.google.com/file/d/19AnPlE0h9WwvIjxV0gLuf5x27jLN7TLS/view?usp=drive_link screenshot for 4.15 page: https://drive.google.com/file/d/19MRjzNGRWfYnK-zcoMozh7Z7eaDDG2L-/view?usp=drive_link
Description of problem:
All files under path /var/log/kube-apiserver/ should have 600 permission. File /var/log/kube-apiserver/termination.log for kube-apiserver on some nodes have 644 permission. $ for node in `oc get node -l node-role.kubernetes.io/control-plane= --no-headers|awk '{print $1}'`;do oc debug node/$node -- chroot /host ls -l /var/log/kube-apiserver/;done Temporary namespace openshift-debug-gj262 is created for debugging node... Starting pod/ip-x-us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` total 221752 -rw-------. 1 root root 209714718 Jul 12 05:47 audit-2023-07-12T05-47-16.625.log -rw-------. 1 root root 13233368 Jul 12 05:54 audit.log -rw-------. 1 root root 646569 Jul 12 04:19 termination.logRemoving debug pod ... Temporary namespace openshift-debug-gj262 was removed. Temporary namespace openshift-debug-cmdgm is created for debugging node... Starting pod/ip-xus-east-2computeinternal-debug ... To use host binaries, run `chroot /host` total 49640 -rw-------. 1 root root 49826363 Jul 12 05:54 audit.log -rw-------. 1 root root 826226 Jul 12 04:23 termination.logRemoving debug pod ... Temporary namespace openshift-debug-cmdgm was removed. Temporary namespace openshift-debug-fdqtv is created for debugging node... Starting pod/ip-xus-east-2computeinternal-debug ... To use host binaries, run `chroot /host` total 270276 -rw-------. 1 root root 209714252 Jul 12 05:34 audit-2023-07-12T05-34-34.205.log -rw-------. 1 root root 51250736 Jul 12 05:54 audit.log -rw-r--r--. 1 root root 4 Jul 12 04:15 termination.logRemoving debug pod ... Temporary namespace openshift-debug-fdqtv was removed. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.0-0.nightly-2023-07-11-092038 True False 91m Cluster version is 4.14.0-0.nightly-2023-07-11-092038
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-11-092038
How reproducible:
Always
Steps to Reproduce:
1.$ for node in `oc get node -l node-role.kubernetes.io/control-plane= --no-headers|awk '{print $1}'`;do oc debug node/$node -- chroot /host ls -l /var/log/kube-apiserver/;done 2. 3.
Actual results:
File /var/log/kube-apiserver/termination.log for kube-apiserver on some nodes has 644 permission.
Expected results:
All files under path /var/log/kube-apiserver/ should have 600 permission.
Additional info:
This issue has been updated to capture a larger ongoing issue around console 304 status responses for plugins. This has been observed for ODF, ACM, MCE, monitoring, and other plugins going back to 4.12. Related links:
Original report from this bug:
Description of problem:
find error logs under console pod logs
Version-Release number of selected component (if applicable):
% oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.0-0.nightly-2023-09-27-073353 True False 37m Cluster version is 4.15.0-0.nightly-2023-09-27-073353
How reproducible:
100% on ipv6 clusters
Steps to Reproduce:
1.% oc -n openshift-console logs console-6fbf69cc49-7jq5b ... E0928 00:35:24.098808 1 handlers.go:172] GET request for "monitoring-plugin" plugin failed with 304 status code E0928 00:35:24.098822 1 utils.go:43] Failed sending HTTP response body: http: request method or response status code does not allow body E0928 00:35:39.611569 1 handlers.go:172] GET request for "monitoring-plugin" plugin failed with 304 status code E0928 00:35:39.611583 1 utils.go:43] Failed sending HTTP response body: http: request method or response status code does not allow body E0928 00:35:54.442150 1 handlers.go:172] GET request for "monitoring-plugin" plugin failed with 304 status code E0928 00:35:54.442167 1 utils.go:43] Failed sending HTTP response body: http: request method or response status code does not allow body
Actual results:
GET request for "monitoring-plugin" plugin failed with 304 status code
Expected results:
no monitoring-plugin related error logs
This is a clone of issue OCPBUGS-27282. The following is the description of the original issue:
—
Description of problem:
We need to make controllerAvailabilityPolicy field inmutable in the HostedCluster spec section to ensure the customer cannot go from/to SingleReplica to HighAvailability.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The install-config.yaml file lets a user set a server group policy for Control plane nodes, and one for Compute nodes, choosing from affinity, soft-affinity, anti-affinity, soft-anti-affinity. Installer will then create the server group if it doesn't exist. The server group policy defined in install-config for Compute nodes is ignored. The worker server group always has the same policy as the Control plane's.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. openshift-install create install-config 2. set Compute's serverGroupPolicy to soft-affinity in install-config.yaml 3. openshift-install create cluster 4. watch the server groups
Actual results:
both master and worker server groups have the default soft-anti-affinity policy
Expected results:
the worker server group should have soft-affinity as its policy
Additional info:
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/122
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
The e2e-gcp-op-layering CI job seems to be continuously and consistently failing during the teardown process. In particular, it appears to be the TestOnClusterBuildRollsOutImage test that is failing whenever it attempts to tear down the node. See: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/4060/pull-ci-openshift-machine-config-operator-master-e2e-gcp-op-layering/1744805949165539328 for an example of a failing job.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
Open a PR to the GitHub MCO repository.
Actual results:
The teardown portion of the TestOnClusterBuildsRollout test fails thusly: utils.go:1097: Deleting machine ci-op-v5qcditr-46b3f-bh29c-worker-c-fcl9f / node ci-op-v5qcditr-46b3f-bh29c-worker-c-fcl9f utils.go:1098: Error Trace: /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:1098 /go/src/github.com/openshift/machine-config-operator/test/e2e-layering/onclusterbuild_test.go:103 /go/src/github.com/openshift/machine-config-operator/test/e2e-layering/helpers_test.go:149 /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:79 /usr/lib/golang/src/testing/testing.go:1150 /usr/lib/golang/src/testing/testing.go:1328 /usr/lib/golang/src/testing/testing.go:1570 Error: Received unexpected error: exit status 1 Test: TestOnClusterBuildRollsOutImage utils.go:1097: Deleting machine ci-op-v5qcditr-46b3f-bh29c-worker-c-fcl9f / node ci-op-v5qcditr-46b3f-bh29c-worker-c-fcl9f utils.go:1098: Error Trace: /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:1098 /go/src/github.com/openshift/machine-config-operator/test/e2e-layering/onclusterbuild_test.go:103 /go/src/github.com/openshift/machine-config-operator/test/e2e-layering/helpers_test.go:149 /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:79 /usr/lib/golang/src/testing/testing.go:1150 /usr/lib/golang/src/testing/testing.go:1328 /usr/lib/golang/src/testing/testing.go:1312 /usr/lib/golang/src/runtime/panic.go:522 /usr/lib/golang/src/testing/testing.go:980 /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:1098 /go/src/github.com/openshift/machine-config-operator/test/e2e-layering/onclusterbuild_test.go:103 /go/src/github.com/openshift/machine-config-operator/test/e2e-layering/helpers_test.go:149 /go/src/github.com/openshift/machine-config-operator/test/helpers/utils.go:79 /usr/lib/golang/src/testing/testing.go:1150 /usr/lib/golang/src/testing/testing.go:1328 /usr/lib/golang/src/testing/testing.go:1570 Error: Received unexpected error: exit status 1 Test: TestOnClusterBuildRollsOutImage
Expected results:
This part of the test should pass.
Additional info:
The way the test teardown process currently works is that it shells out to the oc command to delete the underlying Machine and Node. We delete the underlying machine and node so that the cloud provider will provision us a new one due to issues with opting out of on-cluster builds that have yet to be resolved. At the time this test was written, it was implemented in this way to avoid having to vendor the Machine client and API into the MCO codebase which has since happened. I suspect the issue is that oc is failing in some way since we get an exit status 1 from where it is invoked. Now that the Machine client and API are vendored into the MCO codebase, it makes more sense for us to use those directly instead of shelling out to oc in order to do this since we would get more verbose error messages instead.
Description of problem:
Network operator is not compliant with CIS benchmark rule "Ensure Usage of Unique Service Accounts" [1] as part of "ocp4-cis" profile used in compliance operator [2]. Observed that network operator is using the default service account where default SA comes into play if there is no other service account specified. OpenShift core operators should be compliant with the CIS benchmark, i.e. the operators should run with their own serviceaccount rather than using the "default" one. Raised similar bug for machine-config operator. [1] https://static.open-scap.org/ssg-guides/ssg-ocp4-guide-cis.html#xccdf_org.ssgproject.content_group_accounts [2] https://docs.openshift.com/container-platform/4.11/security/compliance_operator/compliance-operator-supported-profiles.html
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Network operator using default SA
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/machine-config-operator/pull/3919
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
oc-mirror will hit panic when use v2 and mirror from disk to registry
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Create imageset that we are using: cat config.yaml apiVersion: mirror.openshift.io/v1alpha2 kind: ImageSetConfiguration mirror: platform: channels: - name: stable-4.13 minVersion: 4.13.13 maxVersion: 4.13.13 graph: true 2. Mirror to disk by command : `oc-mirror --config config.yaml file://out --v2` 3. Mirror from disk to registry by command: `oc-mirror --config config.yaml --from out/working-dir/ docker://ec2-18-217-139-237.us-east-2.compute.amazonaws.com:5000/ocpv2 --v2`
Actual results:
oc-mirror --from out/working-dir/ docker://ec2-18-217-139-237.us-east-2.compute.amazonaws.com:5000/ocpv2 --v2
--v2 flag identified, flow redirected to the oc-mirror v2 version. PLEASE DO NOT USE that. V2 is still under development and it is not ready to be used.
2023/11/06 03:10:19 [ERROR] : use the --config flag it is mandatory
[root@preserve-fedora36 1106]# oc-mirror --config config.yaml --from out/working-dir/ docker://ec2-18-217-139-237.us-east-2.compute.amazonaws.com:5000/ocpv2 --v2
--v2 flag identified, flow redirected to the oc-mirror v2 version. PLEASE DO NOT USE that. V2 is still under development and it is not ready to be used.
panic: runtime error: index out of range [1] with length 1
goroutine 1 [running]:
github.com/openshift/oc-mirror/v2/pkg/cli.(*ExecutorSchema).Complete(0xc000c28a80, {0xc00012cd20, 0x1, 0x0?})
/go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/pkg/cli/executor.go:330 +0x1a18
github.com/openshift/oc-mirror/v2/pkg/cli.NewMirrorCmd.func1(0xc000005500?, {0xc00012cd20, 0x1, 0x6})
/go/src/github.com/openshift/oc-mirror/vendor/github.com/openshift/oc-mirror/v2/pkg/cli/executor.go:137 +0xfd
github.com/spf13/cobra.(*Command).execute(0xc000005500, {0xc000052080, 0x6, 0x6})
/go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:944 +0x847
github.com/spf13/cobra.(*Command).ExecuteC(0xc000005500)
/go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:1068 +0x3bd
github.com/spf13/cobra.(*Command).Execute(0x0?)
/go/src/github.com/openshift/oc-mirror/vendor/github.com/spf13/cobra/command.go:992 +0x19
main.main()
/go/src/github.com/openshift/oc-mirror/cmd/oc-mirror/main.go:10 +0x1e
Expected results:
No panic
Additional info:
Description of problem:
On Alibaba, some volume snapshot never become ready.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-11-06-182702
How reproducible: sometimes
Steps to Reproduce:
Actual results:
$ oc get volumesnapshot NAME READYTOUSE SOURCEPVC ... mysnapl587m false myclaim ...
Expected results:
The VolumeSnapshot becomes ready in ~1 minute or less (for small volumes)
Additional info:
There seems to be something odd between the external-snapshotter and the CSI driver. From the snapshotter logs:
This sequence is very timing sensitive - sometimes it happens that the cloud finishes the snapshot at step 2., therefore the driver gets snapshot that is ready at step 3. and then everything works OK.
(Sorry, I lost the full logs...)
Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/98
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-28539. The following is the description of the original issue:
—
Description of problem:
Pod capi-ibmcloud-controller-manager stuck in ContainerCreating on IBM cloud
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Built a cluster on ibm cloud and enable TechPreviewNoUpgrade 2. 3.
Actual results:
4.16 cluster $ oc get po NAME READY STATUS RESTARTS AGE capi-controller-manager-6bccdc844-jsm4s 1/1 Running 9 (24m ago) 175m capi-ibmcloud-controller-manager-75d55bfd7d-6qfxh 0/2 ContainerCreating 0 175m cluster-capi-operator-768c6bd965-5tjl5 1/1 Running 0 3h Warning FailedMount 5m15s (x87 over 166m) kubelet MountVolume.SetUp failed for volume "credentials" : secret "capi-ibmcloud-manager-bootstrap-credentials" not found $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-01-21-154905 True False 156m Cluster version is 4.16.0-0.nightly-2024-01-21-154905 4.15 cluster $ oc get po NAME READY STATUS RESTARTS AGE capi-controller-manager-6b67f7cff4-vxtpg 1/1 Running 6 (9m51s ago) 35m capi-ibmcloud-controller-manager-54887589c6-6plt2 0/2 ContainerCreating 0 35m cluster-capi-operator-7b7f48d898-9r6nn 1/1 Running 1 (17m ago) 39m $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.0-0.nightly-2024-01-22-160236 True False 11m Cluster version is 4.15.0-0.nightly-2024-01-22-160236
Expected results:
No pod is in ContainerCreating status
Additional info:
must-gather: https://drive.google.com/file/d/1F5xUVtW-vGizAYgeys0V5MMjp03zkSEH/view?usp=sharing
Description: If tokenConfig.accessTokenInactivityTimeout set to less than 300s, the accessTokenInactivityTimeout doesn't work in hosted cluster whereas in Management cluster, we get below error while trying to set the timeout < 300s :
spec.tokenConfig.accessTokenInactivityTimeout: Invalid value: v1.Duration{Duration:100000000000}: the minimum acceptable token timeout value is 300 seconds*
Steps to reproduce the issue:
1. Install a fresh 4.15 hypershift cluster 2. Configure accessTokenInactivityTimeout as below: $ oc edit hc -n clusters ... spec: configuration: oauth: identityProviders: ... tokenConfig: accessTokenInactivityTimeout: 100s ... 3. Wait for the oauth pods to redeploy and check the oauth cm for updated accessTokenInactivityTimeout value: $ oc get cm oauth-openshift -oyaml -n clusters-hypershift-ci-xxxxx ... tokenConfig: accessTokenInactivityTimeout: 1m40s ... 4. Login to guest cluster with testuser-1 and get the token $ oc login https://a889<...>:6443 -u testuser-1 -p xxxxxxx $ TOKEN=`oc whoami -t`
Actual result:
Wait for 100s and try login with the TOKEN $ oc login --token="$TOKEN" WARNING: Using insecure TLS client config. Setting this option is not supported! Logged into "https://a889<...>:6443" as "testuser-1" using the token provided. You don't have any projects. You can try to create a new project, by running oc new-project <projectname>
Expected result:
1. Login fails if the user is not active within the accessTokenInactivityTimeout seconds.
2. In Management cluster, we get below error when trying to set the timeout to less than 300s :
spec.tokenConfig.accessTokenInactivityTimeout: Invalid value: v1.Duration{Duration:100000000000}: the minimum acceptable token timeout value is 300 seconds*
Implement the same in hosted cluster.
This is a clone of issue OCPBUGS-25483. The following is the description of the original issue:
—
Description of problem:
A regression was identified creating LoadBalancer services in ARO in new 4.14 clusters (handled for new installations in OCPBUGS-24191) The same regression has been also confirmed in ARO clusters upgraded to 4.14
Version-Release number of selected component (if applicable):
4.14.z
How reproducible:
On any ARO cluster upgraded to 4.14.z
Steps to Reproduce:
1. Install an ARO cluster 2. Upgrade to 4.14 from fast channel 3. oc create svc loadbalancer test-lb -n default --tcp 80:8080
Actual results:
# External-IP stuck in Pending $ oc get svc test-lb -n default NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE test-lb LoadBalancer 172.30.104.200 <pending> 80:30062/TCP 15m # Errors in cloud-controller-manager being unable to map VM to nodes $ oc logs -l infrastructure.openshift.io/cloud-controller-manager=Azure -n openshift-cloud-controller-manager I1215 19:34:51.843715 1 azure_loadbalancer.go:1533] reconcileLoadBalancer for service(default/test-lb) - wantLb(true): started I1215 19:34:51.844474 1 event.go:307] "Event occurred" object="default/test-lb" fieldPath="" kind="Service" apiVersion="v1" type="Normal" reason="EnsuringLoadBalancer" message="Ensuring load balancer" I1215 19:34:52.253569 1 azure_loadbalancer_repo.go:73] LoadBalancerClient.List(aro-r5iks3dh) success I1215 19:34:52.253632 1 azure_loadbalancer.go:1557] reconcileLoadBalancer for service(default/test-lb): lb(aro-r5iks3dh/mabad-test-74km6) wantLb(true) resolved load balancer name I1215 19:34:52.528579 1 azure_vmssflex_cache.go:162] Could not find node () in the existing cache. Forcely freshing the cache to check again... E1215 19:34:52.714678 1 azure_vmssflex.go:379] fs.GetNodeNameByIPConfigurationID(/subscriptions/fe16a035-e540-4ab7-80d9-373fa9a3d6ae/resourceGroups/aro-r5iks3dh/providers/Microsoft.Network/networkInterfaces/mabad-test-74km6-master0-nic/ipConfigurations/pipConfig) failed. Error: failed to map VM Name to NodeName: VM Name mabad-test-74km6-master-0 E1215 19:34:52.714888 1 azure_loadbalancer.go:126] reconcileLoadBalancer(default/test-lb) failed: failed to map VM Name to NodeName: VM Name mabad-test-74km6-master-0 I1215 19:34:52.714956 1 azure_metrics.go:115] "Observed Request Latency" latency_seconds=0.871261893 request="services_ensure_loadbalancer" resource_group="aro-r5iks3dh" subscription_id="fe16a035-e540-4ab7-80d9-373fa9a3d6ae" source="default/test-lb" result_code="failed_ensure_loadbalancer" E1215 19:34:52.715005 1 controller.go:291] error processing service default/test-lb (will retry): failed to ensure load balancer: failed to map VM Name to NodeName: VM Name mabad-test-74km6-master-0
Expected results:
# The LoadBalancer gets an External-IP assigned $ oc get svc test-lb -n default NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE test-lb LoadBalancer 172.30.193.159 20.242.180.199 80:31475/TCP 14s
Additional info:
In cloud-provider-config cm in openshift-config namespace, vmType="" When vmType gets changed to "standard" explicitly, the provisioning of the LoadBalancer completes and an ExternalIP gets assigned without errors.
Description of problem:
The current version of openshift/coredns vendors Kubernetes 1.26 packages. OpenShift 4.14 is based on Kubernetes 1.27.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Check https://github.com/openshift/coredns/blob/release-4.14/go.mod
Actual results:
Kubernetes packages (k8s.io/api, k8s.io/apimachinery, and k8s.io/client-go) are at version v0.26
Expected results:
Kubernetes packages are at version v0.27.0 or later.
Additional info:
Using old Kubernetes API and client packages brings risk of API compatibility issues.
Description of problem:
The nodeip-configuration service does not log to the serial console, which makes it difficult to debug problems when networking is not available and there is no access to the node.
Version-Release number of selected component (if applicable):
Reported against 4.13, but present in all releases
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-25448. The following is the description of the original issue:
—
Description of problem:
When upgrading OCP 4.14.6 to 4.15.0-0.nightly-2023-12-13-032512, olm-operator pod always restarts, which blocks the cluster upgrading.
MacBook-Pro:~ jianzhang$ omg get clusterversion 2023-12-15 16:24:34.977 | WARNING | omg.utils.load_yaml:<module>:10 - yaml.CSafeLoader failed to load, using SafeLoader NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.6 True True 4h47m Working towards 4.15.0-0.nightly-2023-12-13-032512: 701 of 873 done (80% complete), waiting on operator-lifecycle-manager MacBook-Pro:~ jianzhang$ omg get pods 2023-12-15 16:47:36.383 | WARNING | omg.utils.load_yaml:<module>:10 - yaml.CSafeLoader failed to load, using SafeLoader NAME READY STATUS RESTARTS AGE catalog-operator-564b666f96-6nmq8 1/1 Running 1 1h59m collect-profiles-28375140-n9f2p 0/1 Succeeded 0 42m collect-profiles-28375155-sf2qj 0/1 Succeeded 0 27m collect-profiles-28375170-xkbxf 0/1 Succeeded 0 12m olm-operator-6bfd5f76bc-xb5lk 0/1 Running 27 1h59m package-server-manager-5b7969559f-68nn7 2/2 Running 0 1h59m packageserver-5ffcb95bff-fvvpx 1/1 Running 0 1h58m packageserver-5ffcb95bff-hgvxt 1/1 Running 0 1h58m MacBook-Pro:~ jianzhang$ omg logs olm-operator-6bfd5f76bc-xb5lk --previous 2023-12-15 16:23:02.300 | WARNING | omg.utils.load_yaml:<module>:10 - yaml.CSafeLoader failed to load, using SafeLoader 2023-12-13T23:38:05.452697228Z time="2023-12-13T23:38:05Z" level=info msg="log level info" 2023-12-13T23:38:05.452950096Z time="2023-12-13T23:38:05Z" level=info msg="TLS keys set, using https for metrics" 2023-12-13T23:38:05.515929950Z time="2023-12-13T23:38:05Z" level=info msg="found nonconforming items" gvr="rbac.authorization.k8s.io/v1, Resource=rolebindings" nonconforming=1 2023-12-13T23:38:05.588194624Z time="2023-12-13T23:38:05Z" level=info msg="found nonconforming items" gvr="/v1, Resource=services" nonconforming=1 2023-12-13T23:38:06.116654658Z time="2023-12-13T23:38:06Z" level=info msg="detected ability to filter informers" canFilter=false 2023-12-13T23:38:06.118496116Z time="2023-12-13T23:38:06Z" level=info msg="registering labeller" gvr="apps/v1, Resource=deployments" index=0 ... ... 2023-12-13T23:38:06.381370939Z time="2023-12-13T23:38:06Z" level=info msg="labeller complete" gvr="rbac.authorization.k8s.io/v1, Resource=clusterrolebindings" index=0 2023-12-13T23:38:06.381424190Z time="2023-12-13T23:38:06Z" level=info msg="starting clusteroperator monitor loop" monitor=clusteroperator 2023-12-13T23:38:06.381467749Z time="2023-12-13T23:38:06Z" level=info msg="detected that every object is labelled, exiting to re-start the process..."
Version-Release number of selected component (if applicable):
MacBook-Pro:~ jianzhang$ oc adm release info --commits registry.ci.openshift.org/ocp/release:4.15.0-0.nightly-2023-12-13-032512 |grep olm operator-lifecycle-manager https://github.com/openshift/operator-framework-olm b4d2b70c34e9654afe30cf724f1dc85a1ce5c683 operator-registry https://github.com/openshift/operator-framework-olm b4d2b70c34e9654afe30cf724f1dc85a1ce5c683
How reproducible:
always
Steps to Reproduce:
1, rerun this prow job: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-4.15-upgrade-from-stable-4.14-ibmcloud-ipi-f28/
Actual results:
Cluster failed to upgrade due to olm pods crash.
Expected results:
Cluster upgraded successfully.
Additional info:
Nothing uses these plugins in the ovnk image, and having them complicates security checking that needs to use a different path to check RPMs instead of stuff build directly in the dockerfile.
Since they're unused, just remove them.
Description of problem:
8.1478 tagged from docker.io/openshift/wildfly-81-centos7:latest479 prefer registry pullthrough when referencing this tag480481 Build and run WildFly 8.1 applications on CentOS 7. For more information about using this builder image, including OpenShift considerations, see https://github.com/openshift-s2i/s2i-wildfly/blob/master/README.md.482 Tags: builder, wildfly, java483 Supports: wildfly:8.1, jee, java484 Example Repo: https://github.com/openshift/openshift-jee-sample.git485486 ! error: Import failed (Unauthorized): you may not have access to the container image "docker.io/openshift/wildfly-81-centos7:latest"487 20 minutes ago488489490error: imported completed with errors491[Mon Oct 23 15:23:32 UTC 2023] Retrying image import openshift/wildfly:10.1492error: tag latest failed: you may not have access to the container image "docker.io/openshift/wildfly-101-centos7:latest"493imagestream.image.openshift.io/wildfly imported with errors494495Name: wildfly496Namespace: openshift497Created: 21 minutes ago
Version-Release number of selected component (if applicable):
4.14 / 4.15
How reproducible:
Often on vSphere jobs, perhaps because they lack a local mirror?
Steps to Reproduce:
1. 2. 3.
Actual results:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/44127/rehearse-44127-periodic-ci-openshift-release-master-okd-scos-4.14-e2e-aws-ovn-serial/1716463869561409536
Expected results:
ci jobs run successfully
Additional info:
Description of problem:
When CPUPartitioning is not set in install-config.yaml a warning message is still generated WARNING CPUPartitioning: is ignored This warning is both incorrect, since the check is against "None" and the the value is an empty string when not set, and also no longer relevant now that https://issues.redhat.com//browse/OCPBUGS-18876 has been fixed.
Version-Release number of selected component (if applicable):
How reproducible:
Every time
Steps to Reproduce:
1. Create an install config with CPUPartitioning not set 2. Run "openshift-install agent create image --dir cluster-manifests/ --log-level debug"
Actual results:
See the output "WARNING CPUPartitioning: is ignored"
Expected results:
No warning
Additional info:
Description of problem:
ConsoleExternalLogLink CRD.ConsoleExternalLogLink CRD creates, displays, modifies, and deletes a new ConsoleExternalLogLink instance AssertionError: Timed out retrying after 30000ms: Expected to find element: `[data-test-id=test-nubya-cell]`, but never found it.
Description of problem:
OLM is supposed to verify that an update to a CRD does not introduce validation that is more restrictive than what is currently in effect. The logic for this only works if a CRD uses a single spec.validation entry, but this is unlikely to ever be the case. Instead, most CRDs use per-version validation schemas.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Create an operator that has a CRD with an entry in spec.versions along with spec.versions[].schema populated with some validation schema. 2. Create a CR 3. Attempt to upgrade to a newer version of the operator, where the CRD is updated to add a new version whose schema validation is more restrictive and will fail against the CR that was previously created
Actual results:
Upgrade succeeds
Expected results:
Upgrade fails
Additional info:
Description of problem:
- Observed that after upgrade to 4.13.30 (from 4.13.24) On all nodes/projects (replicated on two clusters that underwent the same upgrade) - traffic routed from HostNetworked pods (router-default) calling to backends intermittently timeout/fail to reach their destination.
apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-from-openshift-ingress namespace: testing spec: ingress: - from: - namespaceSelector: matchLabels: policy-group.network.openshift.io/ingress: "" podSelector: {} policyTypes: - Ingress
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Upgrade cluster to 4.13.30
2. Apply test pod running basic HTTP instance at random port
3. Apply networkpolicy to allow-from-ingress and begin curl loop against target pod directly from ingressnode (or other worker node) at host chroot level (nodeIP).
4. Observe that curls time out intermittently --> replicator curl loop is below (note inclusion of --connect-timeout flag to help allow loop to continue more rapidly without waiting for full 2m connect timeout on typical syn failure).
$ while true; do curl --connect-timeout 5 --noproxy '*' -k -w "dnslookup: %{time_namelookup} | connect: %{time_connect} | appconnect: %{time_appconnect} | pretransfer: %{time_pretransfer} | starttransfer: %{time_starttransfer} | total: %{time_total} | size: %{size_download} | response: %{response_code}\n" -o /dev/null -s https://<POD>:<PORT>; done
Actual results:
- Traffic to all backends is dropped/degraded as a result of this intermittent failure marking valid/healthy pods as unavailable due to the connection failure to the backends.
Expected results:
- traffic should not be iimpeded, especially when the application of the networkpolicy to allow said traffic is implemented.
Additional info:
RCA UPDATE:
So the problem is that host-network namespace is not labeled by ingress controller and if router pods are hostNetworked, network policy with `policy-group.network.openshift.io/ingress: ""` selector won't allow incoming connections. To reproduce, we need to run ingress controller with `EndpointPublishingStrategy=HostNetwork` https://docs.openshift.com/container-platform/4.14/networking/nw-ingress-controller-endpoint-publishing-strategies.html and then check host-network namespace labels with
oc get ns openshift-host-network --show-labels
# expected this
kubernetes.io/metadata.name=openshift-host-network,network.openshift.io/policy-group=ingress,policy-group.network.openshift.io/host-network=,policy-group.network.openshift.io/ingress=
# but before the fix you will see
kubernetes.io/metadata.name=openshift-host-network,policy-group.network.openshift.io/host-network=
Another way to verify this is the same problem (disruptive, only recommended for test environments) is to make CNO unmanaged
oc scale deployment cluster-version-operator -n openshift-cluster-version --replicas=0 oc scale deployment network-operator -n openshift-network-operator --replicas=0
and then label openshift-host-network namespace manually based on expected labels ^ and see if the problem disappears
Potentially affected versions (may need to reproduce to confirm)
4.16.0, 4.15.0, 4.14.0 since https://issues.redhat.com//browse/OCPBUGS-8070
4.13.30 https://issues.redhat.com/browse/OCPBUGS-22293
4.12.48 https://issues.redhat.com/browse/OCPBUGS-24039
Mitigation/support KCS:
https://access.redhat.com/solutions/7055050
Description of problem:
Currently console frontend and backend is using OpenShift centric UserKind type. In order for the console to work without OAuth server, iow. with. external OIDC it needs to use k8s UserInfo type, which is retrieved querying SelfSubjectReview API
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Console is not working with external OIDC provider
Expected results:
Console will be working with external OIDC provider
Additional info:
This is mainly an API change.
Description of problem:
Due to the way that the termination handlers unit tests are configured, it is possible in some cases for the counter of http requests to the mock handler can cause the test to deadlock and time out. This happens randomly as the ordering of the tests has an effect on when the bug occurs.
Version-Release number of selected component (if applicable):
4.13+
How reproducible:
It happens randomly when run in CI, or when the full suite is run. But if the tests are focused it will happen every time. Focusing on "poll URL cannot be reached" will exploit the unit test.
Steps to Reproduce:
1. add `-focus "poll URL cannot be reached"` to unit test ginkgo arguments 2. run `make unit`
Actual results:
test suite hangs after this output: "Handler Suite when running the handler when polling the termination endpoint and the poll URL cannot be reached should return an error /home/mike/dev/machine-api-provider-aws/pkg/termination/handler_test.go:197"
Expected results:
Tests pass
Additional info:
to fix this we need to isolate the test in its own context block, this patch should do the trick: diff --git a/pkg/termination/handler_test.go b/pkg/termination/handler_test.go index 2b98b08b..0f85feae 100644 --- a/pkg/termination/handler_test.go +++ b/pkg/termination/handler_test.go @@ -187,7 +187,9 @@ var _ = Describe("Handler Suite", func() { Consistently(nodeMarkedForDeletion(testNode.Name)).Should(BeFalse()) }) }) + }) + Context("when the termination endpoint is not valid", func() { Context("and the poll URL cannot be reached", func() { BeforeEach(func() { nonReachable := "abc#1://localhost"
Issue 50 from https://docs.google.com/spreadsheets/d/1TR3ENY-GE_LQL9F-xH6NahtRHdu6IrYGhLqDDA8y0EI/edit#gid=1035185624
Add page dropdown doesn't break anymore and overlays if the window is too small.
Screenshots:
Please review the following PR: https://github.com/openshift/machine-api-provider-openstack/pull/99
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
MGMT-11443 added an API for users to download the rendered nmconnection files used in the ISO, but when using the kube-api that URL isn't given to the user.
This should be added to the infrenv status in the debug info section
Please review the following PR: https://github.com/openshift/image-registry/pull/379
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
In the python script used during bug pre-dispatch, include "networking / network-tools" component.
Please review the following PR: https://github.com/openshift/machine-api-provider-gcp/pull/73
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-28835. The following is the description of the original issue:
—
Description of problem:
NHC failed to watch Metal3 remediation template
Version-Release number of selected component (if applicable):
OCP4.13 and higher
How reproducible:
100%
Steps to Reproduce:
1. Create Metal3RemediationTemplate 2. Install NHCv.0.7.0 3. Create NHC with Metal3RemediationTemplate
Actual results:
E0131 14:07:51.603803 1 reflector.go:147] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: Failed to watch infrastructure.cluster.x-k8s.io/v1beta1, Kind=Metal3RemediationTemplate: failed to list infrastructure.cluster.x-k8s.io/v1beta1, Kind=Metal3RemediationTemplate: metal3remediationtemplates.infrastructure.cluster.x-k8s.io is forbidden: User "system:serviceaccount:openshift-workload-availability:node-healthcheck-controller-manager" cannot list resource "metal3remediationtemplates" in API group "infrastructure.cluster.x-k8s.io" at the cluster scope E0131 14:07:59.912283 1 reflector.go:147] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: Failed to watch infrastructure.cluster.x-k8s.io/v1beta1, Kind=Metal3Remediation: unknown W0131 14:08:24.831958 1 reflector.go:539] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: failed to list infrastructure.cluster.x-k8s.io/v1beta1, Kind=Metal3RemediationTemplate: metal3remediationtemplates.infrastructure.cluster.x-k8s.io is forbidden: User "system:serviceaccount:openshift-workload-availability:node-healthcheck-controller-manager" cannot list resource
Expected results:
No errors
Additional info:
Description of problem:
The customer has a custom apiserver certificate.
This error can be found while trying to uninstall any operator by console:
openshift-console/pods/console-56494b7977-d7r76/console/console/logs/current.log:
2023-10-24T14:13:21.797447921+07:00 E1024 07:13:21.797400 1 operands_handler.go:67] Failed to get new client for listing operands: Get "https://api.<cluster>.<domain>:6443/api?timeout=32s": x509: certificate signed by unknown authority
when trying the same request from the console pod we can see no issue.
We see the root ca that signs apiserver certificate and this CA is trusted in the pod.
It seems the code that provokes this issue is:
https://github.com/openshift/console/blob/master/pkg/server/operands_handler.go#L62-L70
Description of problem:
On Hypershift(Guest) cluster, EFS driver pod stuck at ContainerCreating state
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-08-11-055332
How reproducible:
Always
Steps to Reproduce:
1. Create Hypershift cluster. Flexy template: aos-4_14/ipi-on-aws/versioned-installer-ovn-hypershift-ci 2. Try to install EFS operator and driver from yaml file/web console as mentioned in below steps. a) Create iam role from ccoctl tool and will get ROLE ARN value from the output b) Install EFS operator using the above ROLE ARN value. c) Check EFS operator, node, controller pods are up and running // og-sub-hcp.yaml apiVersion: operators.coreos.com/v1 kind: OperatorGroup metadata: generateName: openshift-cluster-csi-drivers- namespace: openshift-cluster-csi-drivers spec: namespaces: - "" --- apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: aws-efs-csi-driver-operator namespace: openshift-cluster-csi-drivers spec: channel: stable name: aws-efs-csi-driver-operator source: qe-app-registry sourceNamespace: openshift-marketplace config: env: - name: ROLEARN value: arn:aws:iam::301721915996:role/hypershift-ci-16666-openshift-cluster-csi-drivers-aws-efs-cloud- // driver.yaml apiVersion: operator.openshift.io/v1 kind: ClusterCSIDriver metadata: name: efs.csi.aws.com spec: logLevel: TraceAll managementState: Managed operatorLogLevel: TraceAll
Actual results:
aws-efs-csi-driver-controller-699664644f-dkfdk 0/4 ContainerCreating 0 87m
Expected results:
EFS controller pods should be up and running
Additional info:
oc -n openshift-cluster-csi-drivers logs aws-efs-csi-driver-operator-6758c5dc46-b75hb E0821 08:51:25.160599 1 base_controller.go:266] "AWSEFSDriverCredentialsRequestController" controller failed to sync "key", err: cloudcredential.operator.openshift.io "cluster" not found Discussion: https://redhat-internal.slack.com/archives/GK0DA0JR5/p1692606247221239 Installation steps epic: https://issues.redhat.com/browse/STOR-1421
Please review the following PR: https://github.com/openshift/machine-api-operator/pull/1187
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When a cluster is using FIPS in an installation with the agent installer, the reboot in the machine-config-daemon-firstboot.service is not skipped. Since https://issues.redhat.com/browse/MCO-706 the agent installer should be able to skip the firstboot service reboot.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. We cause these prow jobs to install a cluster without fips (HA): periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-baremetal-pxe-ha-agent-ipv4-static-connected-f14 with fips (SNO): periodic-ci-openshift-openshift-tests-private-release-4.15-amd64-nightly-baremetal-sno-agent-ipv4-static-connected-f7 We can find the firstboot service's logs in the must-gather.tar file. 2. 3.
Actual results:
In the machine-config-daemon-firstboot.service logs we can see that the reboot is not skipped when the installation is using fips=true. You can find the logs in the "additional info" section below.
Expected results:
The firstboot service should skip the reboot in the installation.
Additional info:
This is the machine-config-daemon-firstboot logs for a baremetal HA cluster with fips and installed using agent installer: (FIRST REBOOT NOT SKIPPED) Nov 14 11:26:59 worker-00 systemd[1]: Starting Machine Config Daemon Firstboot... Nov 14 11:26:59 worker-00 sh[4182]: sed: can't read /etc/yum.repos.d/*.repo: No such file or directory Nov 14 11:26:59 worker-00 podman[4183]: W1114 11:26:59.393738 1 daemon.go:1673] Failed to persist NIC names: open /rootfs/etc/systemd/network: no such file or directory Nov 14 11:26:59 worker-00 podman[4296]: I1114 11:26:59.866300 4348 daemon.go:457] container is rhel8, target is rhel9 Nov 14 11:26:59 worker-00 podman[4296]: I1114 11:26:59.896550 4348 daemon.go:525] Invoking re-exec /run/bin/machine-config-daemon Nov 14 11:26:59 worker-00 podman[4296]: I1114 11:26:59.955660 4348 update.go:2120] Running: systemctl daemon-reload Nov 14 11:27:00 worker-00 podman[4296]: I1114 11:27:00.537582 4348 rpm-ostree.go:88] Enabled workaround for bug 2111817 Nov 14 11:27:00 worker-00 podman[4296]: I1114 11:27:00.537944 4348 rpm-ostree.go:263] Linking ostree authfile to /etc/mco/internal-registry-pull-secret.json Nov 14 11:27:00 worker-00 podman[4296]: I1114 11:27:00.833062 4348 daemon.go:270] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a9bdfdf95023b7aebbbc9d5d335c973832fceb795ed943f365fefea7db646b66 (415.92.202311130854-0) 67df227c04e9306ddcb78331654ecf0ebb2cb1433498f9c12e832c7d5e74c1d9 Nov 14 11:27:00 worker-00 podman[4296]: I1114 11:27:00.833303 4348 rpm-ostree.go:308] Running captured: rpm-ostree --version Nov 14 11:27:00 worker-00 podman[4296]: I1114 11:27:00.893156 4348 daemon.go:1076] rpm-ostree has container feature Nov 14 11:27:00 worker-00 podman[4296]: I1114 11:27:00.893582 4348 rpm-ostree.go:308] Running captured: rpm-ostree kargs Nov 14 11:27:01 worker-00 podman[4296]: I1114 11:27:01.008588 4348 update.go:2157] Adding SIGTERM protection Nov 14 11:27:01 worker-00 podman[4296]: I1114 11:27:01.008821 4348 update.go:599] Checking Reconcilable for config mco-empty-mc to rendered-worker-ef30fce69107b4fc38dc1020038ebd6a Nov 14 11:27:01 worker-00 podman[4296]: I1114 11:27:01.009121 4348 update.go:1064] FIPS is configured and enabled Nov 14 11:27:01 worker-00 podman[4296]: I1114 11:27:01.009345 4348 update.go:2135] Starting update from mco-empty-mc to rendered-worker-ef30fce69107b4fc38dc1020038ebd6a: &{osUpdate:true kargs:true fips:false passwd:false files:false units:false kernelType:false extensions:false} Nov 14 11:27:01 worker-00 podman[4296]: I1114 11:27:01.055403 4348 update.go:1349] Updating files Nov 14 11:27:01 worker-00 podman[4296]: I1114 11:27:01.055415 4348 update.go:1412] Deleting stale data Nov 14 11:27:01 worker-00 podman[4296]: I1114 11:27:01.055419 4348 update.go:1818] updating the permission of the kubeconfig to: 0o600 Nov 14 11:27:01 worker-00 podman[4296]: I1114 11:27:01.055484 4348 update.go:1784] Checking if absent users need to be disconfigured Nov 14 11:27:01 worker-00 podman[4296]: I1114 11:27:01.055610 4348 update.go:2210] Already in desired image quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:a9bdfdf95023b7aebbbc9d5d335c973832fceb795ed943f365fefea7db646b66 Nov 14 11:27:01 worker-00 podman[4296]: I1114 11:27:01.055616 4348 update.go:2120] Running: rpm-ostree cleanup -p Nov 14 11:27:01 worker-00 podman[4296]: Deployments unchanged. Nov 14 11:27:01 worker-00 podman[4296]: I1114 11:27:01.224788 4348 update.go:2135] Running rpm-ostree [kargs --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1="all" --append=psi=1] Nov 14 11:27:01 worker-00 podman[4296]: I1114 11:27:01.271647 4348 update.go:2120] Running: rpm-ostree kargs --append=systemd.unified_cgroup_hierarchy=1 --append=cgroup_no_v1="all" --append=psi=1 Nov 14 11:27:03 worker-00 podman[4296]: Staging deployment...done Nov 14 11:27:05 worker-00 podman[4296]: Changes queued for next boot. Run "systemctl reboot" to start a reboot Nov 14 11:27:05 worker-00 podman[4296]: I1114 11:27:05.081854 4348 update.go:2135] Rebooting node Nov 14 11:27:05 worker-00 podman[4296]: I1114 11:27:05.127794 4348 update.go:2165] Removing SIGTERM protection Nov 14 11:27:05 worker-00 podman[4296]: I1114 11:27:05.127853 4348 update.go:2135] initiating reboot: Completing firstboot provisioning to rendered-worker-ef30fce69107b4fc38dc1020038ebd6a Nov 14 11:27:05 worker-00 podman[4296]: I1114 11:27:05.235062 4348 update.go:2135] reboot successful Nov 14 11:27:05 worker-00 systemd[1]: machine-config-daemon-firstboot.service: Main process exited, code=killed, status=15/TERM Nov 14 11:27:05 worker-00 systemd[1]: machine-config-daemon-firstboot.service: Failed with result 'signal'. Nov 14 11:27:05 worker-00 systemd[1]: Stopped Machine Config Daemon Firstboot. -- Boot 2f510f83bdb047bb921fc429d67b8e6a -- This is the logs for a baremetal HA cluster without fips and installed using agent installer: (FIST REBOOT SKIPPED) Nov 08 14:27:30 worker-00 systemd[1]: Starting Machine Config Daemon Firstboot... Nov 08 14:27:30 worker-00 sh[4171]: sed: can't read /etc/yum.repos.d/*.repo: No such file or directory Nov 08 14:27:30 worker-00 podman[4172]: W1108 14:27:30.970986 1 daemon.go:1673] Failed to persist NIC names: open /rootfs/etc/systemd/network: no such file or directory Nov 08 14:27:31 worker-00 podman[4273]: I1108 14:27:31.172975 4320 daemon.go:457] container is rhel8, target is rhel9 Nov 08 14:27:31 worker-00 podman[4273]: I1108 14:27:31.202238 4320 daemon.go:525] Invoking re-exec /run/bin/machine-config-daemon Nov 08 14:27:31 worker-00 podman[4273]: I1108 14:27:31.237492 4320 update.go:2120] Running: systemctl daemon-reload Nov 08 14:27:31 worker-00 podman[4273]: I1108 14:27:31.436217 4320 rpm-ostree.go:88] Enabled workaround for bug 2111817 Nov 08 14:27:31 worker-00 podman[4273]: E1108 14:27:31.436346 4320 rpm-ostree.go:285] Merged secret file could not be validated; defaulting to cluster pull secret <nil> Nov 08 14:27:31 worker-00 podman[4273]: I1108 14:27:31.436375 4320 rpm-ostree.go:263] Linking ostree authfile to /var/lib/kubelet/config.json Nov 08 14:27:31 worker-00 podman[4273]: I1108 14:27:31.555415 4320 daemon.go:270] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e03c9248f78a107efb8b12430d46304e8d93981d23fd932e159d518ed675bc92 (415.92.202311061558-0) b8e1dca18619a2e497edf5346d5018615a226da380989ef6720a1a8cdc27adeb Nov 08 14:27:31 worker-00 podman[4273]: I1108 14:27:31.555920 4320 rpm-ostree.go:308] Running captured: rpm-ostree --version Nov 08 14:27:31 worker-00 podman[4273]: I1108 14:27:31.571985 4320 daemon.go:1076] rpm-ostree has container feature Nov 08 14:27:31 worker-00 podman[4273]: I1108 14:27:31.572484 4320 rpm-ostree.go:308] Running captured: rpm-ostree kargs Nov 08 14:27:31 worker-00 podman[4273]: I1108 14:27:31.600313 4320 update.go:186] No changes from mco-empty-mc to rendered-worker-30da1eef7a5d361fc395f2726c8210d5 Nov 08 14:27:31 worker-00 systemd[1]: Finished Machine Config Daemon Firstboot.
This is a clone of issue OCPBUGS-28661. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-28664. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Installer requires the `s3:HeadBucket` even though such permission does not exist. The correct permission for the `HeadBucket` action is `s3:ListBucket` https://docs.aws.amazon.com/AmazonS3/latest/API/API_HeadBucket.html
Version-Release number of selected component (if applicable):
4.16
How reproducible:
always
Steps to Reproduce:
1. Install a cluster using a role with limited permissions 2. 3.
Actual results:
level=warning msg=Action not allowed with tested creds action=iam:DeleteUserPolicy level=warning msg=Tested creds not able to perform all requested actions level=warning msg=Action not allowed with tested creds action=s3:HeadBucket level=warning msg=Tested creds not able to perform all requested actions level=fatal msg=failed to fetch Cluster: failed to fetch dependency of "Cluster": failed to generate asset "Platform Permissions Check": validate AWS credentials: AWS credentials cannot be used to either create new creds or use as-is Installer exit with code 1
Expected results:
Installer should check only for s3:ListBucket
Additional info:
Description of problem:
The script refactoring from https://github.com/openshift/cluster-etcd-operator/pull/1057 introduced a regression. Since the static pod list variable was renamed, it is now empty and won't restore the non-etcd pod yamls anymore.
Version-Release number of selected component (if applicable):
4.14 and later
How reproducible:
always
Steps to Reproduce:
1. create a cluster 2. restore using cluster-restore.sh
Actual results:
the apiserver and other static pods are not immediately restored The script only outputs this log: removing previous backup /var/lib/etcd-backup/member Moving etcd data-dir /var/lib/etcd/member to /var/lib/etcd-backup starting restore-etcd static pod
Expected results:
the non-etcd static pods should be immediately restored by moving them into the manifest directory again. You can see this by the log output: Moving etcd data-dir /var/lib/etcd/member to /var/lib/etcd-backup starting restore-etcd static pod starting kube-apiserver-pod.yaml static-pod-resources/kube-apiserver-pod-7/kube-apiserver-pod.yaml starting kube-controller-manager-pod.yaml static-pod-resources/kube-controller-manager-pod-7/kube-controller-manager-pod.yaml starting kube-scheduler-pod.yaml static-pod-resources/kube-scheduler-pod-8/kube-scheduler-pod.yaml
Additional info:
Description of problem:
monitoring-plugin can not be started on IPv6 disabled cluster as the pod listen on [::]:9443. Monitoring-plugin should listen on [::]:9443 on IPv6 enabled cluster Monitoring-plugin should listen on 0.0.0.0:9443 on IPv6 disabled cluster. $oc logs monitoring-plugin-dc84478c-5rwmm2023/10/14 13:42:41 [emerg] 1#0: socket() [::]:9443 failed (97: Address family not supported by protocol)nginx: [emerg] socket() [::]:9443 failed (97: Address family not supported
Version-Release number of selected component (if applicable):
4.14.0-rc.5
How reproducible:
Always
Steps to Reproduce:
1) disable ipv6 following https://access.redhat.com/solutions/5513111
cat <<EOF |oc create -f - apiVersion: machineconfiguration.openshift.io kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 99-openshift-machineconfig-master-kargs spec: kernelArguments: - ipv6.disable=1 EOF cat <<EOF |oc create -f - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-openshift-machineconfig-worker-kargs spec: kernelArguments: - ipv6.disable=1 EOF
2) Check the mcp status
3) Check the monitoring plugin pod status
Actual results:
1) mcp is pending as monitor-plugin pod can not be schedule
$ oc get mcp |grep worker. worker rendered-worker-ba1d1b8306f65bc5ff53b0c05a54143f False True False 5 3 3 0 3h59m
$oc logs machine-config-controller-5b96788c69-j9d7k I1014 13:05:57.767217 1 drain_controller.go:350] Previous node drain found. Drain has been going on for 0.025260005567777778 hours I1014 13:05:57.767228 1 drain_controller.go:173] node anlim14-c6jbb-worker-b-rgqq5.c.openshift-qe.internal: initiating drain E1014 13:05:58.411241 1 drain_controller.go:144] WARNING: ignoring DaemonSet-managed …… I1014 13:05:58.413116 1 drain_controller.go:144] evicting pod openshift-monitoring/monitoring-plugin-dc84478c-92xr4 E1014 13:05:58.422164 1 drain_controller.go:144] error when evicting pods/"monitoring-plugin-dc84478c-92xr4" -n "openshift-monitoring" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. I1014 13:06:03.422338 1 drain_controller.go:144] evicting pod openshift-monitoring/monitoring-plugin-dc84478c-92xr4 E1014 13:06:03.433295 1 drain_controller.go:144] error when evicting pods/"monitoring-plugin-dc84478c-92xr4" -n "openshift-monitoring" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
2) monitoring-plugin pod listen on [::] which is an invalid address on IPv6 disabled cluster.
$oc extract cm/monitoring-plugin $cat nginx.conf error_log /dev/stdout info; events {} http { include /etc/nginx/mime.types; default_type application/octet-stream; keepalive_timeout 65; server { listen 9443 ssl; listen [::]:9443 ssl; ssl_certificate /var/cert/tls.crt; ssl_certificate_key /var/cert/tls.key; root /usr/share/nginx/html; } }
Expected results:
Monitoring-plugin listens on [::]:9443 on IPv6 enabled cluster
Monitoring-plugin listens on 0.0.0.0:9443 on IPv6 disabled cluster.
Additional info:
The PR about how logging fix this issue. https://github.com/openshift/cluster-logging-operator/pull/2207/files#diff-dc6205a02c6c783e022ae0d4c726327bee4ef34cd1361541d1e3165ee7056b38R43
Please review the following PR: https://github.com/openshift/csi-driver-shared-resource/pull/157
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Code calls secrets instead of configmaps
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
The code https://github.com/openshift/cluster-monitoring-operator/blob/91d735bd8662965037aae60c846c53baa79752ac/pkg/tasks/controlplane.go#L79-L93 makes sure CMO delete the resources it used to manage.
The code was temporarily added in https://github.com/openshift/cluster-monitoring-operator/pull/2039/files
Description of problem:
In a 4.14 cluster, I'm seeing CVO hotloops on ClusterRoleBinding cluster-baremetal-operator and ConfigMap openshift-machine-config-operator/kube-rbac-proxy with empty ManagedFields. # oc logs cluster-version-operator-7cf78c4f65-hfh7f -n openshift-cluster-version | grep -o 'Updating .*due to diff'| sort | uniq -c 93 Updating ClusterRoleBinding cluster-baremetal-operator due to diff 93 Updating ClusterRole machine-api-operator-ext-remediation due to diff 93 Updating ConfigMap openshift-machine-config-operator/kube-rbac-proxy due to diff CVO logs the diff as below: I0919 10:19:24.658975 1 rbac.go:38] Updating ClusterRoleBinding cluster-baremetal-operator due to diff: &v1.ClusterRoleBinding{ TypeMeta: v1.TypeMeta{ - Kind: "", + Kind: "ClusterRoleBinding", - APIVersion: "", + APIVersion: "rbac.authorization.k8s.io/v1", }, ObjectMeta: v1.ObjectMeta{ ... // 2 identical fields Namespace: "openshift-machine-api", SelfLink: "", - UID: "cb8a7ffe-9966-4224-b1b6-3e7db6da7009", + UID: "", - ResourceVersion: "2571", + ResourceVersion: "", Generation: 0, - CreationTimestamp: v1.Time{Time: s"2023-09-19 03:02:31 +0000 UTC"}, + CreationTimestamp: v1.Time{}, DeletionTimestamp: nil, DeletionGracePeriodSeconds: nil, ... // 2 identical fields OwnerReferences: {{APIVersion: "config.openshift.io/v1", Kind: "ClusterVersion", Name: "version", UID: "fb1c6e8c-01bc-415f-8b55-c55a4601bd10", ...}}, Finalizers: nil, - ManagedFields: []v1.ManagedFieldsEntry{ - { - Manager: "cluster-version-operator", - Operation: "Update", - APIVersion: "rbac.authorization.k8s.io/v1", - Time: s"2023-09-19 03:02:31 +0000 UTC", - FieldsType: "FieldsV1", - FieldsV1: s`{"f:metadata":{"f:annotations":{".":{},"f:capability.openshift.i`..., - }, - }, + ManagedFields: nil, }, Subjects: {{Kind: "ServiceAccount", Name: "cluster-baremetal-operator", Namespace: "openshift-machine-api"}}, RoleRef: {APIGroup: "rbac.authorization.k8s.io", Kind: "ClusterRole", Name: "cluster-baremetal-operator"}, } ... I0919 10:14:55.572553 1 core.go:138] Updating ConfigMap openshift-machine-config-operator/kube-rbac-proxy due to diff: &v1.ConfigMap{ TypeMeta: v1.TypeMeta{ - Kind: "", + Kind: "ConfigMap", - APIVersion: "", + APIVersion: "v1", }, ObjectMeta: v1.ObjectMeta{ ... // 2 identical fields Namespace: "openshift-machine-config-operator", SelfLink: "", - UID: "9c6c667f-8e10-4fca-8c1d-c8c0fc158ee5", + UID: "", - ResourceVersion: "164024", + ResourceVersion: "", Generation: 0, - CreationTimestamp: v1.Time{Time: s"2023-09-19 03:01:42 +0000 UTC"}, + CreationTimestamp: v1.Time{}, DeletionTimestamp: nil, DeletionGracePeriodSeconds: nil, ... // 2 identical fields OwnerReferences: {{APIVersion: "config.openshift.io/v1", Kind: "ClusterVersion", Name: "version", UID: "fb1c6e8c-01bc-415f-8b55-c55a4601bd10", ...}}, Finalizers: nil, - ManagedFields: []v1.ManagedFieldsEntry{ - { - Manager: "cluster-version-operator", - Operation: "Update", - APIVersion: "v1", - Time: s"2023-09-19 10:10:23 +0000 UTC", - FieldsType: "FieldsV1", - FieldsV1: s`{"f:data":{},"f:metadata":{"f:annotations":{".":{},"f:include.re`..., - }, - { - Manager: "machine-config-operator", - Operation: "Update", - APIVersion: "v1", - Time: s"2023-09-19 10:10:25 +0000 UTC", - FieldsType: "FieldsV1", - FieldsV1: s`{"f:data":{"f:config-file.yaml":{}}}`, - }, - }, + ManagedFields: nil, }, Immutable: nil, Data: {"config-file.yaml": "authorization:\n resourceAttributes:\n apiVersion: v1\n reso"...}, BinaryData: nil, }
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-09-15-233408
How reproducible:
1/1
Steps to Reproduce:
1. Install a 4.14 cluster 2. 3.
Actual results:
CVO hotloops on ClusterRoleBinding cluster-baremetal-operator and ConfigMap openshift-machine-config-operator/kube-rbac-proxy
Expected results:
CVO doesn't hotloop on resources with empty ManagedFields
Additional info:
Description of problem:
OCPCLOUD-2277 restricted access to the cma metrics. This led to a regression in hypershift e2e tests. Long term is likely for hypershift to remove that dependency but to get things working again we plan to revert the cma change until the dependency can be removed.
PR removing the probes from hypershift is being worked on.
A table in a dashboard relies on the order of the metric labels to merge results
Create a dashboard with a table including this query:
label_replace(sort_desc(sum(sum_over_time(ALERTS{alertstate="firing"}[24h])) by ( alertstate, alertname)), "aaa", "$1", "alertstate", "(.+)")
A single row will be displayed as the query is simulating that the first label `aaa` has a single value.
Expected result:
The table should not rely on a single metric label to merge results but consider all the labels so the expected rows are displayed.
This is a clone of issue OCPBUGS-26767. The following is the description of the original issue:
—
Description of problem:
[inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc get co/image-registry NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE image-registry False True True 50m Available: The deployment does not exist... [inner hamzy@li-3d08e84c-2e1c-11b2-a85c-e2db7bb078fc hamzy-release]$ oc describe co/image-registry ... Message: Progressing: Unable to apply resources: unable to sync storage configuration: cos region corresponding to a powervs region wdc not found ...
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-ppc64le-2024-01-10-083055
How reproducible:
Always
Steps to Reproduce:
1. Deploy a PowerVS cluster in wdc06 zone
Actual results:
See above error message
Expected results:
Cluster deploys
A long-lived cluster updating into 4.16.0-ec.1 was bitten by the Engineering Candidate's month-or-more-old api-int CA rotation (details on early rotation in API-1687). After manually updating /var/lib/kubelet/kubeconfig to include the new CA (which OCPBUGS-25821 is working on automating), multus pods still complained about untrusted api-int:
$ oc -n openshift-multus logs multus-pz7zp | grep api-int | tail -n5 E0119 19:33:52.983918 3194 reflector.go:148] k8s.io/client-go/informers/factory.go:150: Failed to watch *v1.Pod: failed to list *v1.Pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/pods?fieldSelector=spec.nodeName%3Dbuild0-gstfj-m-2.c.openshift-ci-build-farm.internal&resourceVersion=4723865081": tls: failed to verify certificate: x509: certificate signed by unknown authority 2024-01-19T19:33:55Z [error] Multus: [openshift-machine-api/cluster-autoscaler-default-f8dd547c7-dg9t5/f79ff01a-71c2-4f02-b48b-8c23c9e875ce]: error waiting for pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/namespaces/openshift-machine-api/pods/cluster-autoscaler-default-f8dd547c7-dg9t5?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority 2024-01-19T19:33:55Z [verbose] ADD finished CNI request ContainerID:"b554f8edca8ea7672119c1aa71a69e0368fefeb5f8ae2c2659f822b7fa8d3f62" Netns:"/var/run/netns/36923fe0-e28d-422f-8213-233086527baa" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-machine-api;K8S_POD_NAME=cluster-autoscaler-default-f8dd547c7-dg9t5;K8S_POD_INFRA_CONTAINER_ID=b554f8edca8ea7672119c1aa71a69e0368fefeb5f8ae2c2659f822b7fa8d3f62;K8S_POD_UID=f79ff01a-71c2-4f02-b48b-8c23c9e875ce" Path:"", result: "", err: error configuring pod [openshift-machine-api/cluster-autoscaler-default-f8dd547c7-dg9t5] networking: Multus: [openshift-machine-api/cluster-autoscaler-default-f8dd547c7-dg9t5/f79ff01a-71c2-4f02-b48b-8c23c9e875ce]: error waiting for pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/namespaces/openshift-machine-api/pods/cluster-autoscaler-default-f8dd547c7-dg9t5?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority 2024-01-19T19:34:00Z [error] Multus: [openshift-kube-storage-version-migrator/migrator-558d4d48b9-ggjpj/769153af-350b-492b-9589-ede2574aea85]: error waiting for pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/namespaces/openshift-kube-storage-version-migrator/pods/migrator-558d4d48b9-ggjpj?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority 2024-01-19T19:34:00Z [verbose] ADD finished CNI request ContainerID:"cfd0b8ca596411f1e26ae058fc9f015d6edeac407668420c023ff459860423eb" Netns:"/var/run/netns/bc7fbf17-c049-4241-a7dc-7e27acd3c8af" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=openshift-kube-storage-version-migrator;K8S_POD_NAME=migrator-558d4d48b9-ggjpj;K8S_POD_INFRA_CONTAINER_ID=cfd0b8ca596411f1e26ae058fc9f015d6edeac407668420c023ff459860423eb;K8S_POD_UID=769153af-350b-492b-9589-ede2574aea85" Path:"", result: "", err: error configuring pod [openshift-kube-storage-version-migrator/migrator-558d4d48b9-ggjpj] networking: Multus: [openshift-kube-storage-version-migrator/migrator-558d4d48b9-ggjpj/769153af-350b-492b-9589-ede2574aea85]: error waiting for pod: Get "https://api-int.build02.gcp.ci.openshift.org:6443/api/v1/namespaces/openshift-kube-storage-version-migrator/pods/migrator-558d4d48b9-ggjpj?timeout=1m0s": tls: failed to verify certificate: x509: certificate signed by unknown authority
The multus pod needed a delete/replace, and after that it recovered:
$ oc --as system:admin -n openshift-multus delete pod multus-pz7zp pod "multus-pz7zp" deleted $ oc -n openshift-multus get -o wide pods | grep 'NAME\|build0-gstfj-m-2.c.openshift-ci-build-farm.internal' NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES multus-additional-cni-plugins-wrdtt 1/1 Running 1 28h 10.0.0.3 build0-gstfj-m-2.c.openshift-ci-build-farm.internal <none> <none> multus-admission-controller-74d794678b-9s7kl 2/2 Running 0 27h 10.129.0.36 build0-gstfj-m-2.c.openshift-ci-build-farm.internal <none> <none> multus-hxmkz 1/1 Running 0 11s 10.0.0.3 build0-gstfj-m-2.c.openshift-ci-build-farm.internal <none> <none> network-metrics-daemon-dczvs 2/2 Running 2 28h 10.129.0.4 build0-gstfj-m-2.c.openshift-ci-build-farm.internal <none> <none> $ oc -n openshift-multus logs multus-hxmkz | grep -c api-int 0
That need for multus-pod deletion should be automated, to reduce the number of things that need manual touches when the api-int CA rolls.
Seen in 4.16.0-ec.1.
Several multus on this cluster were bit. But others were not, including some on clusters with old kubeconfigs that did not contain the new CA. I'm not clear on what the trigger is, perhaps some clients escape immediate trouble by having exsting api-int connections to servers from back when the servers used the old CA? But deleting the multus pod on a cluster whose /var/lib/kubelet/kubeconfig has not yet been updated will likely reproduce the breakage, at least until OCPBUGS-25821 is fixed.
Not entirely clear, but something like:
Multus still fails to trust api-int until the broken pod is deleted or the container otherwise restarts to notice the updated kubeconfig.
Multus pod automatically pulls in the updated kubeconfig.
One possible implementation would be a liveness probe failing on api-int trust issues, triggering the kubelet to roll the multus container, and the replacement multus container to come up and load the fresh kubeconfig.
Description of problem:
The kubelet is running with `unconfined_service_t`. It should run as `kubelet_exec_t`. This is causing all our plugins to fail because of Selinux denial. sh-5.1# ps -AZ | grep kubelet system_u:system_r:unconfined_service_t:s0 8719 ? 00:24:50 kubelet This issue was previously observed and resolved in 4.14.10.
Version-Release number of selected component (if applicable):
OCP 4.15
How reproducible:
Run ps -AZ | grep kubelet to see kubelet running with wrong label
Steps to Reproduce:
1. 2. 3.
Actual results:
Kubelet is running as unconfined_service_t
Expected results:
Kubelet should run as kubelet_exec_t
Additional info:
Description of problem:
The vSphere code references a Red Hat solution that has been retired in favour of the code being merged into the official documentation. https://github.com/openshift/machine-api-operator/blob/master/pkg/controller/vsphere/reconciler.go#L827
Version-Release number of selected component (if applicable):
4.11-4.13 + main
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
UI presents a message with solution customers can not access. Hardware lower than 15 is not supported, clone stopped. Detected machine template version is 13. Please update machine template: https://access.redhat.com/articles/6090681
Expected results:
Should referenced official documentation: https://docs.openshift.com/container-platform/4.12/updating/updating-hardware-on-nodes-running-on-vsphere.html
Additional info:
Please review the following PR: https://github.com/openshift/kube-rbac-proxy/pull/72
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-27760. The following is the description of the original issue:
—
I noticed this today when looking at component readiness. A ~5% decrease in instability may seem minor, but these can certainly add up. This test passed 713 times in a row on 4.14. You can see today's failure here.
Details below:
-------
Component Readiness has found a potential regression in [sig-cluster-lifecycle] pathological event should not see excessive Back-off restarting failed containers.
Probability of significant regression: 99.96%
Sample (being evaluated) Release: 4.15
Start Time: 2024-01-17T00:00:00Z
End Time: 2024-01-23T23:59:59Z
Success Rate: 94.83%
Successes: 55
Failures: 3
Flakes: 0
Base (historical) Release: 4.14
Start Time: 2023-10-04T00:00:00Z
End Time: 2023-10-31T23:59:59Z
Success Rate: 100.00%
Successes: 713
Failures: 0
Flakes: 4
Description of problem:
If we replace the cluster global pull secret with a empty one then MCO keeps the original secret file in `/etc/machine-config-daemon/orig/var/lib/kubelet/config.json.mcdorig` location.
Version-Release number of selected component (if applicable):
4.12.z
Steps to Reproduce:
1. create a sno cluster using cluster-bot - launch 4.12.9 aws,single-node 2. Replace the pull secret ``` $ cat <<EOF | oc replace -f - apiVersion: v1 data: .dockerconfigjson: e30K kind: Secret metadata: name: pull-secret namespace: openshift-config type: kubernetes.io/dockerconfigjson EOF ``` 3. Wait for cluster to conciliated ``` $ oc get mc NAME GENERATEDBYCONTROLLER IGNITIONVERSION AGE 00-master f6c21976e39cf6cb9e2ca71141478d5e612fb53f 3.2.0 56m 00-worker f6c21976e39cf6cb9e2ca71141478d5e612fb53f 3.2.0 56m 01-master-container-runtime f6c21976e39cf6cb9e2ca71141478d5e612fb53f 3.2.0 56m 01-master-kubelet f6c21976e39cf6cb9e2ca71141478d5e612fb53f 3.2.0 56m 01-worker-container-runtime f6c21976e39cf6cb9e2ca71141478d5e612fb53f 3.2.0 56m 01-worker-kubelet f6c21976e39cf6cb9e2ca71141478d5e612fb53f 3.2.0 56m 99-master-generated-kubelet f6c21976e39cf6cb9e2ca71141478d5e612fb53f 3.2.0 56m 99-master-generated-registries f6c21976e39cf6cb9e2ca71141478d5e612fb53f 3.2.0 56m 99-master-ssh 3.2.0 60m 99-worker-generated-registries f6c21976e39cf6cb9e2ca71141478d5e612fb53f 3.2.0 56m 99-worker-ssh 3.2.0 60m rendered-master-50d505c46c5e1dae8f1d91c81b2e0d1e f6c21976e39cf6cb9e2ca71141478d5e612fb53f 3.2.0 56m rendered-master-619b2780e8787c88c3acb0c68de45a9f f6c21976e39cf6cb9e2ca71141478d5e612fb53f 3.2.0 36m rendered-master-801d3c549c0fb3267cafc7e48968a8ac f6c21976e39cf6cb9e2ca71141478d5e612fb53f 3.2.0 56m rendered-worker-86690adc0446e7f7feb68f9b9690632d f6c21976e39cf6cb9e2ca71141478d5e612fb53f 3.2.0 36m rendered-worker-d7e635328a14333ed6ad27603fe5b5db f6c21976e39cf6cb9e2ca71141478d5e612fb53f 3.2.0 56m ``` 4. debug to the node and check the file ``` $ cat /etc/machine-config-daemon/orig/var/lib/kubelet/config.json.mcdorig ```
Actual results:
orig file have actual pull secretes which was used in initial cluster provision.
Expected results:
There shouldn't be any file with this info
Additional info:
Description of problem:
Agent based installation is stuck on the booting screen for the arm64 SNO cluster.
The installer shuold validate the architecture set by the users in the install-config.yaml with the payload image being used.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
100%
Steps to Reproduce:
[Fixed original version] 1. Create agent ISO with the amd64 payload 2. Boot the created ISO on arm64 server 3. Monitor the booting screen for error [Generalized] 1. Set the install-config.yaml controlPlane.architecture to arm64 2. Try to install with an
Actual results:
The installation is currently stuck on the initial booting screen.
Expected results:
The SNO cluster should be installed without any issues.
Additional info:
Compact cluster installation was successful, here is the prow ci link: https://gcsweb-qe-private-deck-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/qe-private-deck/logs/periodic-ci-openshift-openshift-tests-private-release-4.13-arm64-nightly-baremetal-compact-agent-ipv4-static-connected-p1-f7/1665833590451081216/artifacts/baremetal-compact-agent-ipv4-static-connected-p1-f7/baremetal-lab-agent-install/build-log.txt
Please review the following PR: https://github.com/openshift/cluster-dns-operator/pull/380
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
On the prerelease doc Configure a secondary external gateway, on stop 3. we state the output of said command should confirm the admin policy has been created:
#oc describe apbexternalroute <name> | tail -n 6
First of all this is a typo there is no "apbexternalroute", the correct term is "adminpolicybasedexternalroutes", even if we use the correct term, the resulting output is almost not relevant as per the status of said policy, it just reports on the policy it's self and well some minor details like time and so on.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-10-04-143709
How reproducible:
Every time
Steps to Reproduce:
1. Deploy a cluster 2. Boot up a pod under a namespace 3. $ cat 4.create.abp_static_bar1.yaml later apply said policy apiVersion: k8s.ovn.org/v1 kind: AdminPolicyBasedExternalRoute metadata: name: first-policy spec: ## gateway example from: namespaceSelector: matchLabels: kubernetes.io/metadata.name: bar nextHops: static: - ip: "173.20.0.8" - ip: "173.20.0.9" 4. confirm policy in place: $ oc getadminpolicybasedexternalroutes.k8s.ovn.org NAME LAST UPDATE STATUS first-policy 5. But wow do we test the policies status? The doc's guide doesn't help much: $ oc describeadminpolicybasedexternalroutes.k8s.ovn.org <name> | tail -n 6 $ oc describe adminpolicybasedexternalroutes.k8s.ovn.org first-policy Name: first-policy Namespace: Labels: <none> Annotations: <none> API Version: k8s.ovn.org/v1 Kind: AdminPolicyBasedExternalRoute Metadata: Creation Timestamp: 2023-10-30T20:09:20Z Generation: 1 Resource Version: 10904672 UID: 3c4a60da-a618-45b1-94a8-2085dcdc5631 Spec: From: Namespace Selector: Match Labels: kubernetes.io/metadata.name: bar Next Hops: Static: Bfd Enabled: false Ip: 173.20.0.8 Bfd Enabled: false Ip: 173.20.0.9 Events: <none> Noting regarding policy status shows up, if this is even supported at all, other than fixing the doc, if there is a way to view the status it should be documented. One more thing if there is indeed a policy status shouldn't it also populate the status column here: $ oc get adminpolicybasedexternalroutes.k8s.ovn.org NAME LAST UPDATE STATUS first-policy ^ Asking as on another bug https://issues.redhat.com/browse/OCPBUGS-22706, I recreated a situation where the status should have reported an error yet it never did nor does it update the above table, come to think of it the last update column too has never exposed any data either, in which case why do we even have these two columns to begin with?
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cloud-provider-powervs/pull/43
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
As part of this slack thread.
Description of problem:
When SRE collects data using `oc adm inspect`; the collection reports an error on 'secrets' (see below). This is because of the way SRE manages our hosted platforms, and the SRE users (service accounts) are not 'true admins' and must impersonate admins to preform operations.
$ oc adm inspect --dest-dir=must-gather ns/openshift-sdn Gathering data for ns/openshift-sdn... ... Wrote inspect data to must-gather. error: errors occurred while gathering data: secrets is forbidden: User "system:serviceaccount:openshift-backplane-srep:f2b5cf795ef1fc5289490411d49ab042" cannot list resource "secrets" in API group "" in the namespace "openshift-sdn"
At the end of the day; the 'error' here is 'erroneous' (not a true error) but more of a warning, telling user that a specific object wasn't collected.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
[4.14][AWS EFS][HCP] should not support ARN mode installation in web console
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-09-20-033502
How reproducible:
Always
Steps to Reproduce:
1. Install hypershift cluster with below mentioned details. Flexy template: aos-4_15/ipi-on-aws/versioned-installer-ovn-hypershift-ci 2. Install AWS EFS operator from operator hub which asks for ARN value to add.
Actual results:
It asks for ARN value in web console
Expected results:
It should not ask for ARN value to add as currently not supportable.
Additional info:
Epic: https://issues.redhat.com/browse/STOR-1347 Discussion: 1. https://redhat-internal.slack.com/archives/C01C8502FMM/p1695208537708359 2. https://redhat-internal.slack.com/archives/GK0DA0JR5/p1695305164755109 3. https://redhat-internal.slack.com/archives/CS05TR7BK/p1695357879885239 Attaching screen shot for the same. https://drive.google.com/file/d/11wjzz8-1kFDMKQ4Y2MWdJjjnfLaLmWD5/view?usp=sharing
Bump Golang to v1.20 in Containerfile.operator for RHTAP
This is a clone of issue OCPBUGS-27159. The following is the description of the original issue:
—
This is continuation of OCPBUGS-23342, now the vmware-vsphere-csi-driver-operator cannot connect to vCenter at all. Tested using invalid credentials.
The operator ends up with no Progressing condition during upgrade from 4.11 to 4.12, and cluster-storage-operator interprets it as Progressing=true.
Tracker issue for bootimage bump in 4.16. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-29384.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/ironic-agent-image/pull/97
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-26933. The following is the description of the original issue:
—
Description of problem:
Console is overriding status code of HTTP requests proxied to dynamic plugin services
Version-Release number of selected component (if applicable):
4.15.0-ec.2
How reproducible:
Always
Steps to Reproduce:
1. Create an OpenShift 4.15.0-ec.2 cluster or newer 2. Install ACM 2.9.1 from OperatorHub and create a MultiClusterHub operand 3. Expose the plugin service: oc -n multicluster-engine expose service console-mce-console 4. Set tls.termination to passthrough on route/console-mce-console 5. Compare responses from curling the proxy and the service directly.
Actual results:
$ curl -k -D - -H "Cookie: <REDACTED>" -H "If-Modified-Since: Thu, 07 Dec 2023 14:45:30 GMT" https://console-openshift-console.apps.kevin-415.dev02.red-chesterfield.com/api/plugins/mce/plugin-manifest.json HTTP/1.1 200 OK cache-control: no-cache date: Wed, 10 Jan 2024 21:16:10 GMT last-modified: Thu, 07 Dec 2023 14:45:30 GMT referrer-policy: strict-origin-when-cross-origin x-content-type-options: nosniff x-dns-prefetch-control: off x-frame-options: DENY x-xss-protection: 1; mode=block content-length: 0 curl -k -D - -H "If-Modified-Since: Thu, 07 Dec 2023 14:45:30 GMT" https://console-mce-console-multicluster-engine.apps.kevin-415.dev02.red-chesterfield.com/plugin/plugin-manifest.json HTTP/2 304 cache-control: no-cache last-modified: Thu, 07 Dec 2023 14:45:30 GMT date: Wed, 10 Jan 2024 21:26:33 GMT
Expected results:
Response code of 304 should be returned by the proxy route, not changed to 200.
Additional info:
Introduced by https://github.com/openshift/console/pull/13272
Description of problem:
There is no clear error log when create sts cluster with KMS key without install role in it
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1.Prepare KMS with aws command aws kms create-key --tags TagKey=Purpose,TagValue=Test --description "kms Key" 2.Create sts cluster with KMS key rosa create cluster --cluster-name ying-k1 --sts --role-arn arn:aws:iam::301721915996:role/ying16-Installer-Role --support-role-arn arn:aws:iam::301721915996:role/ying16-Support-Role --controlplane-iam-role arn:aws:iam::301721915996:role/ying16-ControlPlane-Role --worker-iam-role arn:aws:iam::301721915996:role/ying16-Worker-Role --operator-roles-prefix ying-k1-e2g3 --oidc-config-id 23ggvdh2jouranue87r5ujskp8hctisn --region us-west-2 --version 4.12.15 --replicas 2 --compute-machine-type m5.xlarge --machine-cidr 10.0.0.0/16 --service-cidr 172.30.0.0/16 --pod-cidr 10.128.0.0/14 --host-prefix 23 --kms-key-arn arn:aws:kms:us-west-2:301721915996:key/c60b5a31-1a5c-4d73-93ee-67586d0eb90d
Actual results:
It is failed. Here is the install log http://pastebin.test.redhat.com/1100008
Expected results:
There should be a detailed error message for the KMS that has no installer role
Additional info:
It can be successful if set install role arn to KMS key { "Version": "2012-10-17", "Id": "key-default-1", "Statement": [ { "Sid": "Enable IAM User Permissions", "Effect": "Allow", "Principal": { "AWS": [ "arn:aws:iam::301721915996:role/ying16-Installer-Role", "arn:aws:iam::301721915996:root" ] }, "Action": "kms:*", "Resource": "*" } ] }
Description of problem:
Pipelinerun task log switcher is stuck and is not loading the respective task logs when you switch from one task to another.
Version-Release number of selected component (if applicable):
4.15.0
How reproducible:
Always
Steps to Reproduce:
1. Create a pipeline with multiple tasks. 2. Start the pipeline and go to the logs page 3. Switch between the tasks to see its logs.
Actual results:
Not able to click the task on the left hand side and the logs widow is showing blank screen.
Expected results:
Should be able to switch between the tasks and selected task logs should be shown in the log window
Attached Video:
https://drive.google.com/file/d/1pPQm9YYyWZxfCwFnudviSCyqoPHn8D9x/view?usp=sharing
The "Overwriting current silence" information alert should have padding to be consistent with other alert messages.
Description of problem:
Getting rate limit issue and other failures while running "test serverless function" tests
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/prometheus-operator/pull/243
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-29305. The following is the description of the original issue:
—
Description of problem:
There's a typo in the openssl commands within the ovn-ipsec-containerized/ovn-ipsec-host daemonsets. The correct parameter is "-checkend", not "-checkedn".
Version-Release number of selected component (if applicable):
# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.14.10 True False 7s Cluster version is 4.14.10
How reproducible:
Steps to Reproduce:
1. Enable IPsec encryption
# oc patch networks.operator.openshift.io cluster --type=merge -p '{"spec": {"defaultNetwork":{"ovnKubernetesConfig":{"ipsecConfig":{ }}}}}'
Actual results:
Examining the initContainer (ovn-keys) logs
# oc logs ovn-ipsec-containerized-7bcd2 -c ovn-keys
...
+ openssl x509 -noout -dates -checkedn 15770000 -in /etc/openvswitch/keys/ipsec-cert.pem
x509: Use -help for summary.
# oc get ds NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE ovn-ipsec-containerized 1 1 0 1 0 beta.kubernetes.io/os=linux 159m ovn-ipsec-host 1 1 1 1 1 beta.kubernetes.io/os=linux 159m ovnkube-node 1 1 1 1 1 beta.kubernetes.io/os=linux 3h44m
# oc get ds ovn-ipsec-containerized -o yaml | grep edn if ! openssl x509 -noout -dates -checkedn 15770000 -in $cert_pem; then # oc get ds ovn-ipsec-host -o yaml | grep edn if ! openssl x509 -noout -dates -checkedn 15770000 -in $cert_pem; then
Description of problem:
Hosted control plane clusters of OCP 4.15 are using default catalog sources (redhat-operators, certified-operators, community-operators and redhat-marketplace) pointing to the 4.14, thus 4.15 operators are not available and this can't be updated from within the guest.
Version-Release number of selected component (if applicable):
4.15.0
How reproducible:
100%
Steps to Reproduce:
1. check the .spec.image of the default catalog sources in openshift-marketplace namespace
Actual results:
the default catalogs are pointing to :v4.14
Expected results:
they should point to :v4.15 instead
Additional info:
Description of problem:
The image registry operator in Azure by default has two replicas. Every 5 minutes, each of those replicas makes a call to the StorageAccount List operation for the image registry storage account. Azure has published limits for storage account throttling operations. These limits are 100 calls to list operations every 5 minutes based on the subscription & region pair that exists. Because of this, customers are limited to <50 clusters per subscription and region in Azure. This number can change based on the number of image registry replicas as well as customer activity on List storage account operations within that subscription and region. On Azure Red Hat OpenShift managed service, we occasionally have customers exceeding these limits including internal customers for demos, preventing them from creating new clusters within the subscription & region due to these scaling limits.
Version-Release number of selected component (if applicable):
N/A
How reproducible:
Always.
Steps to Reproduce:
1. Scale up the number of image registry pods to hit the 100 / 5 minute List limit (50 replicas, or enough clusters within a given subscription & region) 2. Attempt to create a new cluster 3. Cluster installation may fail due to image-registry cluster operator never going healthy, or the installer not being able to generate a storage account key for the bootstrap node to fetch its ignition config.
Actual results:
storage.AccountsClient#ListAccountSAS: Failure responding to request: StatusCode=429 -- Original Error: autorest/azure: Service returned an error. Status=429 Code="TooManyRequests" Message="The request is being throttled as the limit has been reached for operation type - Read_ObservationWindow_00:05:00. For more information, see - https://aka.ms/srpthrottlinglimits"
Expected results:
Cluster installs successfully
Additional info:
Raising this as a bug since this issue will be persistent across all cluster installations should one exceed the threshold. It will also impact the image-registry pod health.
Please review the following PR: https://github.com/openshift/multus-cni/pull/183
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/machine-api-provider-powervs/pull/66
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description:
Now that the huge e2e test case failures in CI jobs is resolved in the recent jobs observed a Undiagnosed panic detected in pod issue.
Error:
{ pods/openshift-image-registry_cluster-image-registry-operator-7f7bd7c9b4-k8fmh_cluster-image-registry-operator_previous.log.gz:E0825 02:44:06.686400 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference) pods/openshift-image-registry_cluster-image-registry-operator-7f7bd7c9b4-k8fmh_cluster-image-registry-operator_previous.log.gz:E0825 02:44:06.686630 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)}Some Observations:
1)While starting ImageConfigController it Failed to watch *v1.Route: as the server could not find the requested resource",
2)which eventually lead sync problem "E0825 01:26:52.428694 1 clusteroperator.go:104] unable to sync ClusterOperatorStatusController: config.imageregistry.operator.openshift.io "cluster" not found, requeuing"
3)and then while creating deployment resource for "cluster-image-registry-operator" it caused a panic error: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference):"
dnsmasq isn't starting on okd-scos in the bootstrap VM
logs should it failing with "Operation not permitted"
This is a clone of issue OCPBUGS-25696. The following is the description of the original issue:
—
Description of problem:
When deploying a HCP KubeVirt cluster using the hcp's --node-selector cli arg, that node selector is not applied to the "kubevirt-cloud-controller-manager" pods within the HCP namespace. This makes it not possible to pin the entire HCP pods to specific nodes.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
100%
Steps to Reproduce:
1. deploy an hcp kubevirt cluster with the --node-selector cli option 2. 3.
Actual results:
the node selector is not applied to cloud provider kubevirt pod
Expected results:
the node selector should be applied to cloud provider kubevirt pod.
Additional info:
Description of problem:
OCP 4.11 console UI is not consistent in showing what namespaces are managed. Below are the Results, I have also attached the respective images, 1. Viewing installed operators for cp4i namespace shows the multi-namespace operators as managing All namespaces (but really these operators are restricted to 2 namespaces) ------>> Image multins-cp4i.png 2. Viewing installed operators for ibm-common-services namespace shows the multi-namespace operators as managing 2 namespaces------>> image multins-ibm-cs.png 3. Viewing installed operators for All Projects shows the multi-namespace operators as managing 2 namespaces ---->> Image multins-all.p
Slack Thread: Slack Thread https://coreos.slack.com/archives/C6A3NV5J9/p1668535310411939
How reproducible:
1.install operator into "cp4i" namespace (operator group is OwnNamespace with just "cp4i") 2.install operator(s) into "ibm-common-services" namespace (operator group is OwnNamespace with just "ibm-common-services") 3. edit the OperatorGroup in the "ibm-common-services" namespace and add the "cp4i" namespace -now the operators in "ibm-common-services" are included in both "ibm-common-services" and "cp4i" namespaces 4. review the installed operators in the OCP 4.11 console for "cp4i", "ibm-common-services", and "All Projects"
Actual results:
Installed operators in cp4i project incorrectly shows Managed Namespaces as "All Namespaces". More can be seen in image----> multins-cp4i.png
Expected results:
Installed operators in cp4i project correctly shows Managed Namespaces
Additional info:
Slack Thread: Slack Thread https://coreos.slack.com/archives/C6A3NV5J9/p1668535310411939
Description of the problem:
Whenever creating an AgentClusterInstall without an imageSetRef, the assisted-service container crashes due to attempting to access a nil pointer
How reproducible:
100%
Steps to reproduce:
1. Create and agentclusterinstall without an imageSetRef field
Actual results:
assisted-service container crashes
Expected results:
AgentClusterInstall updates with specsynced error or sufficient defaults.
Additional Information:
Seems to be due to the fact that there is no check if spec.ImageSetRef is nil in this function: https://github.com/openshift/assisted-service/blob/91fcb5bc822de96602657efd883ed419bbb64963/internal/controller/controllers/clusterdeployments_controller.go#L1439C3-L1439C3
The final iteration (of 3) of the fix for OCPBUGS-4248 - https://github.com/openshift/cluster-baremetal-operator/pull/341 - uses the (IPv6) API VIP as the IP address for IPv6 BMCs to contact Apache to download the image to mount via virtualmedia.
When the provisioning network is active, this should use the (IPv6) Provisioning VIP unless the virtualMediaViaExternalNetwork flag is true.
Description of problem:
Install 4.14 UPI cluster on azure stack hub, console could not be accessed outside cluster. $ curl -L -k https://console-openshift-console.apps.jimawwt.installer.redhat.wwtatc.com -vv * Trying 10.255.96.76:443... * connect to 10.255.96.76 port 443 failed: Connection timed out * Failed to connect to console-openshift-console.apps.jimawwt.installer.redhat.wwtatc.com port 443: Connection timed out * Closing connection 0 curl: (28) Failed to connect to console-openshift-console.apps.jimawwt.installer.redhat.wwtatc.com port 443: Connection timed out Worker nodes are missing in public lb backend pool $ az network lb address-pool list --lb-name jimawwt-jhvtn -g jimawwt-jhvtn-rg [ { "backendIPConfigurations": [ { "id": "/subscriptions/de7e09c3-b59a-4c7d-9c77-439c11b92879/resourceGroups/jimawwt-jhvtn-rg/providers/Microsoft.Network/networkInterfaces/jimawwt-jhvtn-master-1-nic/ipConfigurations/pipConfig", "resourceGroup": "jimawwt-jhvtn-rg" }, { "id": "/subscriptions/de7e09c3-b59a-4c7d-9c77-439c11b92879/resourceGroups/jimawwt-jhvtn-rg/providers/Microsoft.Network/networkInterfaces/jimawwt-jhvtn-master-0-nic/ipConfigurations/pipConfig", "resourceGroup": "jimawwt-jhvtn-rg" }, { "id": "/subscriptions/de7e09c3-b59a-4c7d-9c77-439c11b92879/resourceGroups/jimawwt-jhvtn-rg/providers/Microsoft.Network/networkInterfaces/jimawwt-jhvtn-master-2-nic/ipConfigurations/pipConfig", "resourceGroup": "jimawwt-jhvtn-rg" } ], "etag": "W/\"7a9d24a2-ff06-4108-9aac-a277595792e3\"", "id": "/subscriptions/de7e09c3-b59a-4c7d-9c77-439c11b92879/resourceGroups/jimawwt-jhvtn-rg/providers/Microsoft.Network/loadBalancers/jimawwt-jhvtn/backendAddressPools/jimawwt-jhvtn", "loadBalancingRules": [ { "id": "/subscriptions/de7e09c3-b59a-4c7d-9c77-439c11b92879/resourceGroups/jimawwt-jhvtn-rg/providers/Microsoft.Network/loadBalancers/jimawwt-jhvtn/loadBalancingRules/api-public", "resourceGroup": "jimawwt-jhvtn-rg" }, { "id": "/subscriptions/de7e09c3-b59a-4c7d-9c77-439c11b92879/resourceGroups/jimawwt-jhvtn-rg/providers/Microsoft.Network/loadBalancers/jimawwt-jhvtn/loadBalancingRules/a1a1c7bfe78c14a41a9149d42d698824-TCP-80", "resourceGroup": "jimawwt-jhvtn-rg" }, { "id": "/subscriptions/de7e09c3-b59a-4c7d-9c77-439c11b92879/resourceGroups/jimawwt-jhvtn-rg/providers/Microsoft.Network/loadBalancers/jimawwt-jhvtn/loadBalancingRules/a1a1c7bfe78c14a41a9149d42d698824-TCP-443", "resourceGroup": "jimawwt-jhvtn-rg" } ], "name": "jimawwt-jhvtn", "provisioningState": "Succeeded", "resourceGroup": "jimawwt-jhvtn-rg" } ] Similar bug OCPBUGS-14762 detected on Azure UPI. On installer side, we checked that public lb name and backendpool name for UPI are the same as ASH IPI.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-10-06-234925
How reproducible:
Always when installing Azure Stack UPI on 4.14
Steps to Reproduce:
1. Install UPI on Azure Stack Hub on 4.14 2. 3.
Actual results:
Worker nodes are missing in public lb backendpool
Expected results:
worker nodes are added into public lb backendpool and application can be accessed outside cluster
Additional info:
Issue is only detected on 4.14 azure stack hub UPI. It works on ASH IPI and 4.13/4.12 ASH UPI.
Please review the following PR: https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/176
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
since in-cluster prometheus-operator and UWM prometheus-operator pods are scheduled to master nodes, see from
enabled UWM and add topologySpreadConstraints for in-cluster prometheus-operator and UWM prometheus-operator(set topologyKey to node-role.kubernetes.io/master), topologySpreadConstraints takes effect for in-cluster prometheus-operator, but not for UWM prometheus-operator
apiVersion: v1 data: config.yaml: | enableUserWorkload: true prometheusOperator: topologySpreadConstraints: - maxSkew: 1 topologyKey: node-role.kubernetes.io/master whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app.kubernetes.io/name: prometheus-operator kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring
in-cluster prometheus-operator, topologySpreadConstraints settings are loaded to prometheus-operator pod and deployment, see
$ oc -n openshift-monitoring get deploy prometheus-operator -oyaml | grep topologySpreadConstraints -A7 topologySpreadConstraints: - labelSelector: matchLabels: app.kubernetes.io/name: prometheus-operator maxSkew: 1 topologyKey: node-role.kubernetes.io/master whenUnsatisfiable: DoNotSchedule volumes: $ oc -n openshift-monitoring get pod -l app.kubernetes.io/name=prometheus-operator -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES prometheus-operator-65496d5b78-fb9nq 2/2 Running 0 105s 10.128.0.71 juzhao-0813-szb9h-master-0.c.openshift-qe.internal <none> <none> $ oc -n openshift-monitoring get pod prometheus-operator-65496d5b78-fb9nq -oyaml | grep topologySpreadConstraints -A7 topologySpreadConstraints: - labelSelector: matchLabels: app.kubernetes.io/name: prometheus-operator maxSkew: 1 topologyKey: node-role.kubernetes.io/master whenUnsatisfiable: DoNotSchedule volumes:
but the topologySpreadConstraints settings are not loaded to UWM prometheus-operator pod and deployment
$ oc -n openshift-user-workload-monitoring get cm user-workload-monitoring-config -oyaml apiVersion: v1 data: config.yaml: | prometheusOperator: topologySpreadConstraints: - maxSkew: 1 topologyKey: node-role.kubernetes.io/master whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app.kubernetes.io/name: prometheus-operator kind: ConfigMap metadata: creationTimestamp: "2023-08-14T08:10:49Z" labels: app.kubernetes.io/managed-by: cluster-monitoring-operator app.kubernetes.io/part-of: openshift-monitoring name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring resourceVersion: "212490" uid: 048f91cb-4da6-4b1b-9e1f-c769096ab88c $ oc -n openshift-user-workload-monitoring get deploy prometheus-operator -oyaml | grep topologySpreadConstraints -A7 no result $ oc -n openshift-user-workload-monitoring get pod -l app.kubernetes.io/name=prometheus-operator NAME READY STATUS RESTARTS AGE prometheus-operator-77bcdcbd9c-m5x8z 2/2 Running 0 15m $ oc -n openshift-user-workload-monitoring get pod prometheus-operator-77bcdcbd9c-m5x8z -oyaml | grep topologySpreadConstraints no result
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-08-11-055332
How reproducible:
always
Steps to Reproduce:
1. see the description 2. 3.
Actual results:
topologySpreadConstraints settings are not loaded to UWM prometheus-operator pod and deployment
Expected results:
topologySpreadConstraints settings loaded to UWM prometheus-operator pod and deployment
Description of problem:
There is a new zone in PowerVS called dal12. We need to add this zone to the list of supported zones in the installer.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Deploy OpenShift cluster to the zone 2. 3.
Actual results:
Fails
Expected results:
Works
Additional info:
Description of problem:
If the authentication.config/cluster Type=="" but the OAuth/User APIs are already missing, the console-operator won't update the authentication.config/cluster status with its own client as it's crashing on being unable to retrieve OAuthClients.
Version-Release number of selected component (if applicable):
4.15.0
How reproducible:
100%
Steps to Reproduce:
1. scale oauth-apiserver to 0 2. set featuregates to TechPreviewNotUpgradable 3. watch the authentication.config/cluster .status.oidcClients
Actual results:
The client for the console does not appear.
Expected results:
The client for the console should appear.
Additional info:
Please review the following PR: https://github.com/openshift/multus-cni/pull/212
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
In accounts with a large amount of resources, the destroy code will fail to list all resources. This has revealed some changes that need to be made to the destroy code to handle these situations.
Version-Release number of selected component (if applicable):
How reproducible:
Difficult - but we have an account where we can reproduce it consistently
Steps to Reproduce:
1. Try to destroy a cluster in an account with a large amount of resources. 2. Fail. 3.
Actual results:
Fail to destroy
Expected results:
Destroy succeeds
Additional info:
This is a clone of issue OCPBUGS-30119. The following is the description of the original issue:
—
Description of problem:
`ensureSigningCertKeyPair` and `ensureTargetCertKeyPair` are always updating secret type. if the secret requires metadata update, its previous content will not be retained
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Install 4.6 cluster (or make sure installer-generated secrets have `type: SecretTypeTLS` instead of `type: kubernetes.io/tls` 2. Run secret sync 3. Check secret contents
Actual results:
Secret was regenerated with new content
Expected results:
Existing content should be preserved, content is not modified
Additional info:
This causes api-int CA update for clusters born in 4.6 or earlier.
Please review the following PR: https://github.com/openshift/cluster-csi-snapshot-controller-operator/pull/160
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
This is a clone of issue OCPBUGS-28579. The following is the description of the original issue:
—
Bring the downstream operator-controller repo up-to-date with the v0.7.0 upstream release.
Description of problem:
Original issue reported here: https://issues.redhat.com/browse/ACM-6189 reported by QE and customer.
Using ACM/hive, customers can deploy Openshift on vSphere. In the upcoming release of ACM 2.9, we support customers on OCP 4.12 - 4.15. ACM UI updates the install config as users add configurations details.
This has worked for several releases over the last few years. However in OCP 4.13+ the format has changed and there is now additional validation to check if the datastore is a full path.
As per https://issues.redhat.com/browse/SPLAT-1093, removal of the legacy fields should not happen until later, so any legacy configurations such as relative paths should still work.
Version-Release number of selected component (if applicable):
ACM 2.9.0-DOWNSTREAM-2023-10-24-01-06-09 OpenShift 4.14.0-rc.7 OpenShift 4.13.18 OpenShift 4.12.39
How reproducible:
Always
Steps to Reproduce:
1. Deploy OCP 4.12 on vSphere using legacy field and relative path without folder (e.g. platform.vsphere.defaultDatastore: WORKLOAD-DS 2. Installer passes. 3. Deploy OCP 4.12 on vSphere using legacy field and relative path WITH folder (e.g. platform.vsphere.defaultDatastore: WORKLOAD-DS-Folder/WORKLOAD-DS 4. Installer fails. 5. Deploy OCP 4.12 on vSphere using legacy field and FULL path (e.g. platform.vsphere.defaultDatastore: /Workload Datacenter/datastore/WORKLOAD-DS-Folder/WORKLOAD-DS 6. Installer fails. 7. Deploy OCP 4.13 on vSphere using legacy field and relative path without folder (e.g. platform.vsphere.defaultDatastore: WORKLOAD-DS 8. Installer fails. 9. Deploy OCP 4.13 on vSphere using legacy field and relative path WITH folder (e.g. platform.vsphere.defaultDatastore: WORKLOAD-DS-Folder/WORKLOAD-DS 10. Installer passes. 11. Deploy OCP 4.13 on vSphere using legacy field and FULL path (e.g. platform.vsphere.defaultDatastore: /Workload Datacenter/datastore/WORKLOAD-DS-Folder/WORKLOAD-DS 12. Installer fails.
Actual results:
Default Datastore Value | OCP 4.12 | OCP 4.13 | OCP 4.14 |
---|---|---|---|
/Workload Datacenter/datastore/WORKLOAD-DS-Folder/WORKLOAD-DS | No | Yes | Yes |
WORKLOAD-DS-Folder/WORKLOAD-DS | No | Yes | Yes |
WORKLOAD-DS | Yes | No | No |
For OCP 4.12.z managed clusters deployments name-only path is the only one that works as expected.
For OCP 4.13.z+ managed cluster deployments only full name and relative path with folder works as expected.
Expected results:
OCP 4.13.z+ takes relative path without specifying the folder like OCP 4.12.z does.
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.oc -n openshift-machine-api get role/cluster-autoscaler-operator -o yaml 2. Observe missing watch verb 3. Tail cluster-autoscaler logs to see error status.go:444] No ClusterAutoscaler. Reporting available. I0919 16:40:52.877216 1 status.go:244] Operator status available: at version 4.14.0-rc.1 E0919 16:40:53.719592 1 reflector.go:148] github.com/openshift/client-go/config/informers/externalversions/factory.go:101: Failed to watch *v1.ClusterOperator: unknown (get clusteroperators.config.openshift.io)
Actual results:
Expected results:
Additional info:
Description of problem:
After running ./openshift-install destroy cluster, TagCategory still exist # ./openshift-install destroy cluster --dir cluster --log-level debug DEBUG OpenShift Installer 4.15.0-0.nightly-2023-12-18-220750 DEBUG Built from commit 2b894776f1653ab818e368fa625019a6de82a8c7 DEBUG Power Off Virtual Machines DEBUG Powered off VirtualMachine=sgao-devqe-spn2w-master-2 DEBUG Powered off VirtualMachine=sgao-devqe-spn2w-master-1 DEBUG Powered off VirtualMachine=sgao-devqe-spn2w-master-0 DEBUG Powered off VirtualMachine=sgao-devqe-spn2w-worker-0-kpg46 DEBUG Powered off VirtualMachine=sgao-devqe-spn2w-worker-0-w5rrn DEBUG Delete Virtual Machines INFO Destroyed VirtualMachine=sgao-devqe-spn2w-rhcos-generated-region-generated-zone INFO Destroyed VirtualMachine=sgao-devqe-spn2w-master-2 INFO Destroyed VirtualMachine=sgao-devqe-spn2w-master-1 INFO Destroyed VirtualMachine=sgao-devqe-spn2w-master-0 INFO Destroyed VirtualMachine=sgao-devqe-spn2w-worker-0-kpg46 INFO Destroyed VirtualMachine=sgao-devqe-spn2w-worker-0-w5rrn DEBUG Delete Folder INFO Destroyed Folder=sgao-devqe-spn2w DEBUG Delete StoragePolicy=openshift-storage-policy-sgao-devqe-spn2w INFO Destroyed StoragePolicy=openshift-storage-policy-sgao-devqe-spn2w DEBUG Delete Tag=sgao-devqe-spn2w INFO Deleted Tag=sgao-devqe-spn2w DEBUG Delete TagCategory=openshift-sgao-devqe-spn2w INFO Deleted TagCategory=openshift-sgao-devqe-spn2w DEBUG Purging asset "Metadata" from disk DEBUG Purging asset "Master Ignition Customization Check" from disk DEBUG Purging asset "Worker Ignition Customization Check" from disk DEBUG Purging asset "Terraform Variables" from disk DEBUG Purging asset "Kubeconfig Admin Client" from disk DEBUG Purging asset "Kubeadmin Password" from disk DEBUG Purging asset "Certificate (journal-gatewayd)" from disk DEBUG Purging asset "Cluster" from disk INFO Time elapsed: 29s INFO Uninstallation complete! # govc tags.category.ls | grep sgao openshift-sgao-devqe-spn2w
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-12-18-220750
How reproducible:
always
Steps to Reproduce:
1. IPI install OCP on vSphere 2. Destroy cluster installed, check TagCategory
Actual results:
TagCategory still exist
Expected results:
TagCategory should be deleted
Additional info:
Also reproduced in openshift-install-linux-4.14.0-0.nightly-2023-12-20-184526,4.13.0-0.nightly-2023-12-21-194724, while 4.12.0-0.nightly-2023-12-21-162946 have not this issue
This is a clone of issue OCPBUGS-24587. The following is the description of the original issue:
—
Installation some operators. After some time the ResolutionFailed showing up:
$ kubectl get subscription.operators -A -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,ResolutionFailed:.status.conditions[?(@.type=="ResolutionFailed")].status,MSG:.status.conditions[?(@.type=="ResolutionFailed")].message' NAMESPACE NAME ResolutionFailed MSG infra-sso rhbk-operator True [failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.67.215:50051: connect: connection refused", failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused"] metallb-system metallb-operator-sub True [failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.67.215:50051: connect: connection refused", failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused"] multicluster-engine multicluster-engine True [failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.67.215:50051: connect: connection refused", failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused"] open-cluster-management acm-operator-subscription True [failed to populate resolver cache from source redhat-marketplace/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.67.215:50051: connect: connection refused", failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused"] openshift-cnv kubevirt-hyperconverged True [failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused", failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.202.255:50051: connect: connection refused"] openshift-gitops-operator openshift-gitops-operator True [failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused", failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.202.255:50051: connect: connection refused"] openshift-local-storage local-storage-operator True [failed to populate resolver cache from source community-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.14.92:50051: connect: connection refused", failed to populate resolver cache from source certified-operators/openshift-marketplace: failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 172.30.202.255:50051: connect: connection refused"] openshift-nmstate kubernetes-nmstate-operator <none> <none> openshift-operators devworkspace-operator-fast-redhat-operators-openshift-marketplace <none> <none> openshift-operators external-secrets-operator <none> <none> openshift-operators web-terminal <none> <none> openshift-storage lvms <none> <none> openshift-storage mcg-operator-stable-4.14-redhat-operators-openshift-marketplace <none> <none> openshift-storage ocs-operator-stable-4.14-redhat-operators-openshift-marketplace <none> <none> openshift-storage odf-csi-addons-operator-stable-4.14-redhat-operators-openshift-marketplace <none> <none> openshift-storage odf-operator <none> <none>
At the package server logs you can see one time the catalog source is not available, after a while the catalog source is available but the error doesn't disappear from the subscription.
Package server logs:
time="2023-12-05T14:27:09Z" level=warning msg="error getting bundle stream" action="refresh cache" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 172.30.37.69:50051: connect: connection refused\"" source="{redhat-operators openshift-marketplace}" time="2023-12-05T14:27:09Z" level=info msg="updating PackageManifest based on CatalogSource changes: {community-operators openshift-marketplace}" action="sync catalogsource" address="community-operators.openshift-marketplace.svc:50051" name=community-operators namespace=openshift-marketplace time="2023-12-05T14:28:26Z" level=info msg="updating PackageManifest based on CatalogSource changes: {redhat-marketplace openshift-marketplace}" action="sync catalogsource" address="redhat-marketplace.openshift-marketplace.svc:50051" name=redhat-marketplace namespace=openshift-marketplace time="2023-12-05T14:30:23Z" level=info msg="updating PackageManifest based on CatalogSource changes: {certified-operators openshift-marketplace}" action="sync catalogsource" address="certified-operators.openshift-marketplace.svc:50051" name=certified-operators namespace=openshift-marketplace time="2023-12-05T14:35:56Z" level=info msg="updating PackageManifest based on CatalogSource changes: {certified-operators openshift-marketplace}" action="sync catalogsource" address="certified-operators.openshift-marketplace.svc:50051" name=certified-operators namespace=openshift-marketplace time="2023-12-05T14:37:28Z" level=info msg="updating PackageManifest based on CatalogSource changes: {community-operators openshift-marketplace}" action="sync catalogsource" address="community-operators.openshift-marketplace.svc:50051" name=community-operators namespace=openshift-marketplace time="2023-12-05T14:37:28Z" level=info msg="updating PackageManifest based on CatalogSource changes: {redhat-operators openshift-marketplace}" action="sync catalogsource" address="redhat-operators.openshift-marketplace.svc:50051" name=redhat-operators namespace=openshift-marketplace time="2023-12-05T14:39:40Z" level=info msg="updating PackageManifest based on CatalogSource changes: {redhat-marketplace openshift-marketplace}" action="sync catalogsource" address="redhat-marketplace.openshift-marketplace.svc:50051" name=redhat-marketplace namespace=openshift-marketplace time="2023-12-05T14:46:07Z" level=info msg="updating PackageManifest based on CatalogSource changes: {certified-operators openshift-marketplace}" action="sync catalogsource" address="certified-operators.openshift-marketplace.svc:50051" name=certified-operators namespace=openshift-marketplace time="2023-12-05T14:47:37Z" level=info msg="updating PackageManifest based on CatalogSource changes: {redhat-operators openshift-marketplace}" action="sync catalogsource" address="redhat-operators.openshift-marketplace.svc:50051" name=redhat-operators namespace=openshift-marketplace time="2023-12-05T14:48:21Z" level=info msg="updating PackageManifest based on CatalogSource changes: {community-operators openshift-marketplace}" action="sync catalogsource" address="community-operators.openshift-marketplace.svc:50051" name=community-operators namespace=openshift-marketplace time="2023-12-05T14:49:53Z" level=info msg="updating
4.14.3
1. Install an operator for example metallb 2. Wait until the catalog pod is not available for on time. 3. ResolutionFailed doesn't disappear anymore
ResolutionFailed doesn't disappear anymore from subscription.
ResolutionFailed disappear from subscription.
Description of the problem:
After installing a VSphere platform spoke from the infrastructure operator, deleting and re-creating the agentserviceconfig results in the assisted-service pod continually crashing and being unable to recover
How reproducible:
100%
Steps to reproduce:
1. Install a spoke cluster with platformType: VSphere
2. Delete and re-create the agentserviceconfig
Actual results:
The assisted-service pod panics due to accessing a nil pointer
Expected results:
The assisted-service pod starts correctly and the vsphere cluster can continue to be managed
Workaround:
Delete all of the cluster resources related to the VSphere spoke cluster
Please review the following PR: https://github.com/openshift/machine-api-provider-azure/pull/90
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
[Jira:"Test Framework"] monitor test azure-metrics-collector collection failure in https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/28395/pull-ci-openshift-origin-master-e2e-agnostic-ovn-cmd/1724427658311241728
Looks like Azure is throttling our request. We should probably try some retry mechanism.
Relevant thread: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1699977299650309
Description of problem:
a cluster installed with baselineCapabilitySet: None have build available while the build capability is disabled ❯ oc get -o json clusterversion version | jq '.spec.capabilities' { "baselineCapabilitySet": "None" } ❯ oc get -o json clusterversion version | jq '.status.capabilities.enabledCapabilities' null ❯ oc get build -A NAME AGE cluster 5h23m
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-10-04-143709
How reproducible:
100%
Steps to Reproduce:
1.install a cluster with baselineCapabilitySet: None
Actual results:
❯ oc get build -A NAME AGE cluster 5h23m
Expected results:
❯ oc get -A build error: the server doesn't have a resource type "build"
slack thread with more info: https://redhat-internal.slack.com/archives/CF8SMALS1/p1696527133380269
Please review the following PR: https://github.com/openshift/cloud-provider-azure/pull/86
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-26952. The following is the description of the original issue:
—
Description of problem:
no ipsec on cluster post NS mc's deletion during ipsecConfig mode `Full`, on an upgraded cluster from 4.14 ->4.15 build
Version-Release number of selected component (if applicable):
bot build on https://github.com/openshift/cluster-network-operator/pull/2191
How reproducible:
Always
Steps to Reproduce:
Steps: 1. Cluster on EW+NS cluster(4.14), Upgraded to above bot build to check ipsecConfig modes 2. ipsecConfig mode changed to Full 3. Deleted NS MCs 4. new MCs spawned up as `80-ipsec-master-extensions` and `80-ipsec-worker-extensions` 5. cluster settled with no ipsec at all (no ovn-ipsec-host ds) 6. mode still Full
Actual results:
mode Full actually replicated Diasbled state on above steps
Expected results:
Just NS IPsec should have gone away. EW should have persisted
Additional info:
Description of problem:
AdmissionWebhookMatchConditions are enabled by default in Kubernetes 1.28, but we are currently disabling the feature gate in openshift/api. As a result, e2e tests are failing with Kubernetes 1.28 bump: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/1646/pull-ci-openshift-kubernetes-master-e2e-aws-ovn-fips/1684354421837795328
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. 2. 3.
Actual results:
AdmissionWebhookMatchConditions tests are failing
Expected results:
AdmissionWebhookMatchConditions should pass
Additional info:
Let me know once this is fixed so that we can drop the commit that skip these tests.
Description of problem:
In the quick search, if you search for word net you can see two options with the same name and description, one is for the source to image option and the other is for the sample option but there is no way to differentiate in quick search
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Go to topology or Add page and select quick search 2. Search for net or node you will see confusing options 3.
Actual results:
Similar options with no differentiation in the quick search menu
Expected results:
Some way to differentiate different options in the quick search menu
Additional info:
Description of problem:
Impossible to create NFV workers
Version-Release number of selected component (if applicable):
4.15 (current master)
Actual results:
I1024 02:36:28.388445 1 controller.go:156] sj6vp0y3-56ae0-2f4wl-worker-0-ph4nw: reconciling Machine I1024 02:36:29.068382 1 controller.go:349] sj6vp0y3-56ae0-2f4wl-worker-0-ph4nw: reconciling machine triggers idempotent create I1024 02:36:31.426442 1 controller.go:115] "msg"="Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference" "controller"="machine-controller" "name"="sj6vp0y3-56ae0-2f4wl-worker-0-ph4nw" "namespace"="openshift-machine-api" "object"={"name":"sj6vp0y3-56ae0-2f4wl-worker-0-ph4nw","namespace":"openshift-machine-api"} "reconcileID"="1041b0ba-067a-4e94-8a2a-f71f46821275" panic: runtime error: invalid memory address or nil pointer dereference [recovered] panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x27c49ff] goroutine 247 [running]: sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1() /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:116 +0x1fa panic({0x2a72f60, 0x430deb0}) /usr/lib/golang/src/runtime/panic.go:884 +0x213 github.com/openshift/machine-api-provider-openstack/pkg/machine.MachineToInstanceSpec(0xc0006698c0, {0xc000a49940, 0x1, 0x4}, {0xc000a49980, 0x1, 0x4}, {0xc00029aa00, 0x6a6}, {0x30ab820, ...}, ...) /go/src/sigs.k8s.io/cluster-api-provider-openstack/pkg/machine/convert.go:317 +0xb9f github.com/openshift/machine-api-provider-openstack/pkg/machine.(*OpenstackClient).convertMachineToCapoInstanceSpec(0xc0000f11f0, {0x30cb3b0, 0xc000c50b80}, 0xc0006698c0) /go/src/sigs.k8s.io/cluster-api-provider-openstack/pkg/machine/actuator.go:157 +0x23b github.com/openshift/machine-api-provider-openstack/pkg/machine.(*OpenstackClient).createInstance(0xc0000f11f0, {0xc000c50b80?, 0xc00072a1b0?}, 0xc0006698c0, {0x30cb3b0, 0xc000c50b80}) /go/src/sigs.k8s.io/cluster-api-provider-openstack/pkg/machine/actuator.go:246 +0x137 github.com/openshift/machine-api-provider-openstack/pkg/machine.(*OpenstackClient).reconcile(0xc0000f11f0, {0x30c5530, 0xc00072a1b0}, 0xc0006698c0) /go/src/sigs.k8s.io/cluster-api-provider-openstack/pkg/machine/actuator.go:201 +0x23e github.com/openshift/machine-api-provider-openstack/pkg/machine.(*OpenstackClient).Create(0xc000a42150?, {0x30c5530?, 0xc00072a1b0?}, 0x0?) /go/src/sigs.k8s.io/cluster-api-provider-openstack/pkg/machine/actuator.go:172 +0x25 github.com/openshift/machine-api-operator/pkg/controller/machine.(*ReconcileMachine).Reconcile(0xc0002ab6d0, {0x30c5530, 0xc00072a1b0}, {{{0xc000c90a50?, 0x0?}, {0xc0000014a0?, 0xc00087bd48?}}}) /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/github.com/openshift/machine-api-operator/pkg/controller/machine/controller.go:350 +0xbb8 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x30c9578?, {0x30c5530?, 0xc00072a1b0?}, {{{0xc000c90a50?, 0xb?}, {0xc0000014a0?, 0x0?}}}) /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119 +0xc8 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc000322a00, {0x30c5488, 0xc00028e2d0}, {0x2b57480?, 0xc0000e64c0?}) /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316 +0x3ca sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc000322a00, {0x30c5488, 0xc00028e2d0}) /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266 +0x1d9 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2() /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227 +0x85 created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 /go/src/sigs.k8s.io/cluster-api-provider-openstack/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:223 +0x587
Expected results:
It should work
I think this is related to https://github.com/openshift/machine-api-provider-openstack/pull/87
Description of problem:
New deployment of BM IPI using provisioning network with IPV6 is showing: http://XXXX:XXXX:XXXX:XXXX::X:6180/images/ironic-python-agernt.kernel.... connection timed out (http://ipxe.org/4c0a6092)" error
Version-Release number of selected component (if applicable):
Openshift 4.12.32 Also seen in Openshift 4.14.0-rc.5 when adding new nodes
How reproducible:
Very frequent
Steps to Reproduce:
1. Deploy cluster using BM with provided config 2. 3.
Actual results:
Consistent failures depending of the version of OCP used to deploy
Expected results:
No error, successful deployment
Additional info:
Things checked while the bootstrap host is active and the installation information is still valid (and failing): - tried downloading the "ironic-python-agent.kernel" file from different places (bootstrap, bastion hosts, another provisioned host) and in all cases it worked: [core@control-1-ru2 ~]$ curl -6 -v -o ironic-python-agent.kernel http://[XXXX:XXXX:XXXX:XXXX::X]:80/images/ironic-python-agent.kernel \* Trying XXXX:XXXX:XXXX:XXXX::X... \* TCP_NODELAY set % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Connected to XXXX:XXXX:XXXX:XXXX::X (xxxx:xxxx:xxxx:xxxx::x) port 80 #0) > GET /images/ironic-python-agent.kernel HTTP/1.1 > Host: [xxxx:xxxx:xxxx:xxxx::x] > User-Agent: curl/7.61.1 > Accept: */* > < HTTP/1.1 200 OK < Date: Fri, 27 Oct 2023 08:28:09 GMT < Server: Apache < Last-Modified: Thu, 26 Oct 2023 08:42:16 GMT < ETag: "a29d70-6089a8c91c494" < Accept-Ranges: bytes < Content-Length: 10657136 < { [14084 bytes data] 100 10.1M 100 10.1M 0 0 597M 0 --:--:-- --:--:-- --:--:-- 597M \* Connection #0 to host xxxx:xxxx:xxxx:xxxx::x left intact This verifies some of the components like the network setup and the httpd service running on ironic pods. - Also gathered listing of the contents of the ironic pod running in podman, specially in the shared directory. The contents of /shared/html/inspector.ipxe seems correct compared to a working installation, also all files look in place. - Logs from the ironic container shows the errors coming from the node being deployed, we also show here the curl log to compare: xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx - - [27/Oct/2023:08:19:55 +0000] "GET /images/ironic-python-agent.kernel HTTP/1.1" 400 226 "-" "iPXE/1.0.0+ (4bd064de)" xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx - - [27/Oct/2023:08:19:55 +0000] "GET /images/ironic-python-agent.kernel HTTP/1.1" 400 226 "-" "iPXE/1.0.0+ (4bd064de)" xxxx:xxxx:xxxx:xxxx::x - - [27/Oct/2023:08:20:23 +0000] "GET /images/ironic-python-agent.kernel HTTP/1.1" 200 10657136 "-" "curl/7.61.1" cxxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx - - [27/Oct/2023:08:20:23 +0000] "GET /images/ironic-python-agent.kernel HTTP/1.1" 400 226 "-" "iPXE/1.0.0+ (4bd064de)" xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx - - [27/Oct/2023:08:20:23 +0000] "GET /images/ironic-python-agent.kernel HTTP/1.1" 400 226 "-" "iPXE/1.0.0+ (4bd064de)" Seems like an issue with iPXE and IPV6
Description of problem:
I've noticed that 'agent-cluster-install.yaml' and 'journal.export' from the agent gather process contain passwords. It's important not to expose password information in any of these generated files.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. Generate an agent ISO by utilising agent-config and install-config, including platform credentials 2. Boot the ISO that was created 3. Run the agent-gather command on the node 0 machine to generate files.
Actual results:
The 'agent-cluster-install.yaml' and 'journal.export' are containing the passwords information.
Expected results:
Password should be redacted.
Additional info:
This is a clone of issue OCPBUGS-24716. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Currently the openshift-baremetal-install binary is dynamically linked to libvirt-client, meaning that it is only possible to run it on a RHEL system with libvirt installed.
A new version of the libvirt bindings, v1.8010.0, allows the library to be loaded only on demand, so that users who do not execute any libvirt code can run the rest of the installer without needing to install libvirt. (See this comment from Dan Berrangé.) In practice, the "rest of the installer" is everything except the baremetal destroy cluster command (which destroys the bootstrap storage pool - though only if the bootstrap itself has already been successfully destroyed - and has probably never been used by anybody ever). The Terraform providers all run in a separate binary.
There is also a pure-go libvirt library that can be used even within a statically-linked binary on any platform, even when interacting with libvirt. The libvirt terraform provider that does almost all of our interaction with libvirt already uses this library.
Description of problem:
[ibm-vpc-block-csi-driver] xfs volume snapshot volume mount failed of "Filesystem has duplicate UUID"
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-26-132453
How reproducible:
Always
Steps to Reproduce:
1. Install an openshift cluster on ibmcloud; 2. Create a pvc with the ibm-vpc-block csi storageclass and one pod consume the pvc; 3. Write some data to the pod's volume and sync; 4. Create a volumesnapshot and wait it ReadyToUse; 5. Create a pvc restore the volumesnapshot and create one pod consume the restored pvc;
Actual results:
In step5: the volume mount failed of 07-27 21:36:08.572 Mounting command: mount 07-27 21:36:08.572 Mounting arguments: -t xfs -o defaults /dev/disk/by-id/virtio-0787-6ec22828-ec32-4 /var/lib/kubelet/plugins/kubernetes.io/csi/vpc.block.csi.ibm.io/ecef50d905ba489935099cad29a3773220fec45334e7546951706454894073e7/globalmount 07-27 21:36:08.572 Output: mount: /var/lib/kubelet/plugins/kubernetes.io/csi/vpc.block.csi.ibm.io/ecef50d905ba489935099cad29a3773220fec45334e7546951706454894073e7/globalmount: wrong fs type, bad option, bad superblock on /dev/vde, missing codepage or helper program, or other error. Check the dmesg -> [14530.520622] XFS (vde): Filesystem has duplicate UUID a758102c-fbdd-41ef-b4de-60a546bf554b - can't mount [14531.119703] XFS (vde): Filesystem has duplicate UUID a758102c-fbdd-41ef-b4de-60a546bf554b - can't mount [14532.229388] XFS (vde): Filesystem has duplicate UUID a758102c-fbdd-41ef-b4de-60a546bf554b - can't mount [14534.348809] XFS (vde): Filesystem has duplicate UUID a758102c-fbdd-41ef-b4de-60a546bf554b - can't mount [14538.396705] XFS (vde): Filesystem has duplicate UUID a758102c-fbdd-41ef-b4de-60a546bf554b - can't mount [14546.472831] XFS (vde): Filesystem has duplicate UUID a758102c-fbdd-41ef-b4de-60a546bf554b - can't mount [14562.523028] XFS (vde): Filesystem has duplicate UUID a758102c-fbdd-41ef-b4de-60a546bf554b - can't mount [14594.636819] XFS (vde): Filesystem has duplicate UUID a758102c-fbdd-41ef-b4de-60a546bf554b - can't mount [14658.749442] XFS (vde): Filesystem has duplicate UUID a758102c-fbdd-41ef-b4de-60a546bf554b - can't mount [14780.863678] XFS (vde): Filesystem has duplicate UUID a758102c-fbdd-41ef-b4de-60a546bf554b - can't mount
Expected results:
In step5: the restored volume should mount successfully and the pod become Running
Additional info:
looks like a bug in the CSI driver, it mount without `-o nouuid`
The agent-tui interface for editing the network config for the Agent ISO at boot time only runs on the graphical console (tty1). It's difficult to run two copies, so this gives the most value for now when there is a graphical console available.
However, when the host has only a serial console, there are two consequences:
Both situations could be resolved by allowing agent-tui to run on the serial console instead of the graphical console when there is no graphical console.
This is a clone of issue OCPBUGS-26554. The following is the description of the original issue:
—
Chinese translation in topology was invalid, see https://github.com/openshift/console/pull/13458
This is a clone of issue OCPBUGS-29476. The following is the description of the original issue:
—
Description of problem:
Core CAPI CRDs not deployed on unsupported platforms even when explicitly needed by other operators. An example of this is on VSphere clusters. CAPI is not yet supported on VSphere clusters, but the CAPI IPAM CRDs, are needed by other operators than the usual consumer, cluster-capi-operator and the CAPI controllers.
Version-Release number of selected component (if applicable):
How reproducible:
Launch a techpreview cluster for an unsupported platform (e.g. vsphere/azure). Check that the Core CAPI CRDs are not present.
Steps to Reproduce:
$ oc get crds | grep cluster.x-k8s.io
Actual results:
Core CAPI CRDs are not present (only the metal ones)
Expected results:
Core CAPI CRDs should be present
Additional info:
Cluster configuration page fields are not visible.
Screenshot : https://drive.google.com/file/d/17TrZNE2dY-AH-vUwcsjvC4E8wxiyPb9n/view?usp=drive_link
To support external OIDC on hypershift, but not on self-managed, we need different schemas for the authentication CRD on a default-hypershift versus a default-self-managed. This requires us to change rendering so that it honors the clusterprofile.
Then we have to update the installer to match, then update hypershift, then update the manifests.
Description of problem:
We need to make a d/s sync with the u/s multus to support the expose of MTU in the network-status annotation.
The PR was merged u/s https://github.com/k8snetworkplumbingwg/multus-cni/pull/1250
Description of problem:
When use the command `oc-mirror list operators --catalog=registry.redhat.io/redhat/certified-operator-index:v4.12 -v 9` , at begging the response code is 200 okay , when the command will hang for a while , then will got response code 401.
Version-Release number of selected component (if applicable):
How reproducible:
sometimes
Steps to Reproduce:
Using the advanced cluster management package as an example. 1. oc-mirror list operators --catalog=registry.redhat.io/redhat/certified-operator-index:v4.12 -v 9
Actual results: After hang a while , will got 401 code , seems when timeout the oc-mirror try again forgot to read the credentials
level=debug msg=fetch response received digest=sha256:a67257cfe913ad09242bf98c44f2330ec7e8261ca3a8db3431cb88158c3d4837 mediatype=application/vnd.docker.image.rootfs.diff.tar.gzip response.header.accept-ranges=bytes response.header.age=714959 response.header.connection=keep-alive response.header.content-length=80847073 response.header.content-type=binary/octet-stream response.header.date=Mon, 06 Feb 2023 06:52:06 GMT response.header.etag="a428fafd37ee58f4bdeae1a7ff7235b5-1" response.header.last-modified=Fri, 16 Sep 2022 17:54:09 GMT response.header.server=AmazonS3 response.header.via=1.1 010c0731b9775a983eceaec0f5fa6a2e.cloudfront.net (CloudFront) response.header.x-amz-cf-id=rEfKWnJdasWIKnjWhYyqFn9eHY8v_3Y9WwSRnnkMTkPayHlBxWX1EQ== response.header.x-amz-cf-pop=HIO50-C1 response.header.x-amz-replication-status=COMPLETED response.header.x-amz-server-side-encryption=AES256 response.header.x-amz-storage-class=INTELLIGENT_TIERING response.header.x-amz-version-id=GfqTTjWbdqB0sreyjv3fyo1k6LQ9kZKC response.header.x-cache=Hit from cloudfront response.status=200 OK size=80847073 url=https://registry.redhat.io/v2/redhat/certified-operator-index/blobs/sha256:a67257cfe913ad09242bf98c44f2330ec7e8261ca3a8db3431cb88158c3d4837 level=debug msg=fetch response received digest=sha256:d242c7b4380d3c9db3ac75680c35f5c23639a388ad9313f263d13af39a9c8b8b mediatype=application/vnd.docker.image.rootfs.diff.tar.gzip response.header.accept-ranges=bytes response.header.age=595868 response.header.connection=keep-alive response.header.content-length=98028196 response.header.content-type=binary/octet-stream response.header.date=Tue, 07 Feb 2023 15:56:56 GMT response.header.etag="f702c84459b479088565e4048a890617-1" response.header.last-modified=Wed, 18 Jan 2023 06:55:12 GMT response.header.server=AmazonS3 response.header.via=1.1 7f5e0d3b9ea85d0d75063a66c0ebc840.cloudfront.net (CloudFront) response.header.x-amz-cf-id=Tw9cjJjYCy8idBiQ1PvljDkhAoEDEzuDCNnX6xJub4hGeh8V0CIP_A== response.header.x-amz-cf-pop=HIO50-C1 response.header.x-amz-replication-status=COMPLETED response.header.x-amz-server-side-encryption=AES256 response.header.x-amz-storage-class=INTELLIGENT_TIERING response.header.x-amz-version-id=nt7yY.YmjWF0pfAhzh_fH2xI_563GnPz response.header.x-cache=Hit from cloudfront response.status=200 OK size=98028196 url=https://registry.redhat.io/v2/redhat/certified-operator-index/blobs/sha256:d242c7b4380d3c9db3ac75680c35f5c23639a388ad9313f263d13af39a9c8b8b level=debug msg=fetch response received digest=sha256:664a8226a152ea0f1078a417f2ec72d3a8f9971e8a374859b486b60049af9f18 mediatype=application/vnd.docker.container.image.v1+json response.header.accept-ranges=bytes response.header.age=17430 response.header.connection=keep-alive response.header.content-length=24828 response.header.content-type=binary/octet-stream response.header.date=Tue, 14 Feb 2023 08:37:35 GMT response.header.etag="57eb6fdca8ce82a837bdc2cebadc3c7b-1" response.header.last-modified=Mon, 13 Feb 2023 16:11:57 GMT response.header.server=AmazonS3 response.header.via=1.1 0c96ded7ff282d2dbcf47c918b6bb500.cloudfront.net (CloudFront) response.header.x-amz-cf-id=w9zLDWvPJ__xbTpI8ba5r9DRsFXbvZ9rSx5iksG7lFAjWIthuokOsA== response.header.x-amz-cf-pop=HIO50-C1 response.header.x-amz-replication-status=COMPLETED response.header.x-amz-server-side-encryption=AES256 response.header.x-amz-version-id=Enw8mLebn4.ShSajtLqdo4riTDHnVEFZ response.header.x-cache=Hit from cloudfront response.status=200 OK size=24828 url=https://registry.redhat.io/v2/redhat/certified-operator-index/blobs/sha256:664a8226a152ea0f1078a417f2ec72d3a8f9971e8a374859b486b60049af9f18 level=debug msg=fetch response received digest=sha256:130c9d0ca92e54f59b68c4debc5b463674ff9555be1f319f81ca2f23e22de16f mediatype=application/vnd.docker.image.rootfs.diff.tar.gzip response.header.accept-ranges=bytes response.header.age=829779 response.header.connection=keep-alive response.header.content-length=26039246 response.header.content-type=binary/octet-stream response.header.date=Sat, 04 Feb 2023 22:58:25 GMT response.header.etag="a08688b701b31515c6861c69e4d87ebd-1" response.header.last-modified=Tue, 06 Dec 2022 20:50:51 GMT response.header.server=AmazonS3 response.header.via=1.1 000f4a2f631bace380a0afa747a82482.cloudfront.net (CloudFront) response.header.x-amz-cf-id=S-h31zheAEOhOs6uH52Rpq0ZnoRRdd5VfaqVbZWXzAX-Zym-0XtuKA== response.header.x-amz-cf-pop=HIO50-C1 response.header.x-amz-replication-status=COMPLETED response.header.x-amz-server-side-encryption=AES256 response.header.x-amz-storage-class=INTELLIGENT_TIERING response.header.x-amz-version-id=BQOjon.COXTTON_j20wZbWWoDEmGy1__ response.header.x-cache=Hit from cloudfront response.status=200 OK size=26039246 url=https://registry.redhat.io/v2/redhat/certified-operator-index/blobs/sha256:130c9d0ca92e54f59b68c4debc5b463674ff9555be1f319f81ca2f23e22de16f level=debug msg=do request digest=sha256:db8e9d2f583af66157f383f9ec3628b05fa0adb0d837269bc9f89332c65939b9 mediatype=application/vnd.docker.image.rootfs.diff.tar.gzip request.header.accept=application/vnd.docker.image.rootfs.diff.tar.gzip, */* request.header.range=bytes=13417268- request.header.user-agent=opm/alpha request.method=GET size=91700480 url=https://registry.redhat.io/v2/redhat/certified-operator-index/blobs/sha256:db8e9d2f583af66157f383f9ec3628b05fa0adb0d837269bc9f89332c65939b9 level=debug msg=fetch response received digest=sha256:db8e9d2f583af66157f383f9ec3628b05fa0adb0d837269bc9f89332c65939b9 mediatype=application/vnd.docker.image.rootfs.diff.tar.gzip response.header.cache-control=max-age=0, no-cache, no-store response.header.connection=keep-alive response.header.content-length=99 response.header.content-type=application/json response.header.date=Tue, 14 Feb 2023 13:34:06 GMT response.header.docker-distribution-api-version=registry/2.0 response.header.expires=Tue, 14 Feb 2023 13:34:06 GMT response.header.pragma=no-cache response.header.registry-proxy-request-id=0d7ea55f-e96d-4311-885a-125b32c8e965 response.header.www-authenticate=Bearer realm="https://registry.redhat.io/auth/realms/rhcc/protocol/redhat-docker-v2/auth",service="docker-registry",scope="repository:redhat/certified-operator-index:pull" response.status=401 Unauthorized size=91700480 url=https://registry.redhat.io/v2/redhat/certified-operator-index/blobs/sha256:db8e9d2f583af66157f383f9ec3628b05fa0adb0d837269bc9f89332c65939b9.
Expected results:
Should always read the credentials for the command .
Description of the problem:
The Agent CR can reference a Secret containing a token for pulling ignition. This is generally used by HyperShift. The agent controller takes the token from the referenced Secret and applies it to the host in the DB. However, if the token is rotated, the agent controller doesn't notice this, and the agent continues to pull ignition with the old token, which obviously fails. The agent controller must watch these Secrets so that it will reconcile when the Secret is updated.
How reproducible:
100%
Steps to reproduce:
1. Create a hosted cluster and another host to be added
2. Wait for the token to be rotated in the Secret
3. Notice that the agent is still pulling with the old token
Actual results:
The agent is still pulling with the old token
Expected results:
The agent is pulls with the old token
Description of problem:
Run the command `oc adm ocp-certificates monitor-certificates` will panic.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
100%
Steps to Reproduce:
1. `oc adm ocp-certificates monitor-certificates`
Actual results:
panic:
Expected results:
no panic
Additional info:
Please review the following PR: https://github.com/openshift/ovn-kubernetes/pull/1884
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
The issue:
An interesting issue came up on #forum-ui-extensibility. There was an attempt to use extensions to nest a details page under a details page that contained a horizontal nav. This caused an issue with rendering the page content when a sub link was clicked – which caused confusion.
The why:
The reason this happened was the resource details page had a tab that contained a resource list page. This resource list page showed a number of items of CRs that when clicked would try to append their name onto the URL. This confused the navigation, thinking that this path must be another tab, so no tabs were selected and no content was visible. The goal was to reuse this longer path name as a details page of its own with its own horizontal nav. This issue is a conceptual misunderstanding of the way our list & details pages work in OpenShift Console.
List Pages are sometimes found via direct navigation links. List pages are almost all shown on the Search page, allowing a user to navigate to both existing nav items and other non-primary resources.
Details Pages are individual items found in the List Pages (a row). These are stand alone pages that show details of a singular CR and optionally can have tabs that list other resources – but they always transition to a fresh Details page instead of compounding on the currently visible one.
The ask:
If we could document this in a fashion that can help Plugin developers share the same UX that the rest of the Console does then we will have a more unified approach to UX within the Console and through any installed Plugins.
Description of problem:
After the PF5 upgrade, older components using PF4 dropdown menus had list style bullets appear for unordered lists
Version-Release number of selected component (if applicable):
How reproducible:
Metrics Plugin still uses PF4 components and styling
Additional info:
PatterFly removes list-style bullets or numbers from the <ul>/<ol> elements by default and then adds them where needed. The OCP console chose to override this because of the amount of <ul>/<ol> elements in our codebase that expect the default bullet or numbers to be present.
Bug screenshots
https://drive.google.com/drive/folders/1rP6Ls1R2GJoTArHg0oild5SWIWvNaMUv
Please review the following PR: https://github.com/openshift/csi-external-snapshotter/pull/134
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/k8s-prometheus-adapter/pull/74
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
worker CSR are pending, so no worker nodes available
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-10-06-234925
How reproducible:
Always
Steps to Reproduce:
Create a cluster with profile - aws-c2s-ipi-disconnected-private-fips
Actual results:
Workers csrs are pending
Expected results:
workers should be up and running all CSRs approved
Additional info:
failed to find machine for node ip-10-143-1-120” , in logs of cluster-machine-approver Seems like we should have ips like “ip-10-143-1-120.ec2.internal” failing here - https://github.com/openshift/cluster-machine-approver/blob/master/pkg/controller/csr_check.go#L263
Must-gather - https://drive.google.com/file/d/15tz9TLdTXrH6bSBSfhlIJ1l_nzeFE1R3/view?usp=sharing
template for installation - https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blob/master/functionality-testing/aos-4_14/ipi-on-aws/versioned-installer-customer_vpc-disconnected_private_cluster-fips-c2s-ci
Please review the following PR: https://github.com/openshift/cluster-storage-operator/pull/424
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
In the newly added tabs for Revisions and Routes in service details page, the details of other service is also displayed. It should filter for the particular service
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Always
Steps to Reproduce:
1. install serverless operator 2. Create serving instance 3. create multiple service in a namespace 4. Click on any service and go to Revisions, Routes and Pods page
Actual results:
Revisions and routes from other service also displayed
Expected results:
Revisions and routes for that particular service should be displayed
Additional info:
After NM introduced dns-change event, we are creating an infinite loop of on-prem-resolv-prepender.service runs. This is because our prepender script ALWAYS runs `nmcli general reload dns-rc`, no matter if the changes are needed for real or not.
Because of this, we have the following
1) NM change DNS
2) dispatcher script append a server to /etc/resolv.conf
3) dispatcher invoked again as new dns-change event.
4) dispatcher check again and creates new /etc/resolv.conf, the same as old
5) NM change DNS, dns-change event is invoked
6) goto 3
As a fix, prepender script should check if the newly generated file differs from existing /etc/resolv.conf and only apply change if needed.
When the ingress operator's clientca-configmap controller reconciles an IngressController, this controller attempts to add a finalizer to the IngressController if that finalizer is absent. This controller erroneously attempts to add the missing finalizer even if the IngressController is marked for deletion, which results in an error. This error causes the controller to retry the deletion and log the error multiple times.
I observed this in CI for OCP 4.14 and was able to reproduce it on 4.11.37, and it probably affects earlier versions as well. The problematic code was added in https://github.com/openshift/cluster-ingress-operator/pull/450/commits/0f36470250c3089769867ebd72e25c413a29cda2 in OCP 4.9 to implement NE-323.
Easily.
1. Create a configmap in the "openshift-config" namespace (to reproduce this issue, it is not necessary that the configmap have a valid TLS certificate and key):
oc -n openshift-config create configmap client-ca-cert
2. Create an IngressController that specifies spec.clientTLS.clientCA.name to point to the configmap from the previous step:
oc create -f - <<EOF apiVersion: operator.openshift.io/v1 kind: IngressController metadata: name: test-client-ca-configmap namespace: openshift-ingress-operator spec: domain: example.xyz endpointPublishingStrategy: type: Private clientTLS: clientCA: name: client-ca-cert clientCertificatePolicy: Required EOF
3. Delete the IngressController:
oc -n openshift-ingress-operator delete ingresscontrollers/test-client-ca-configmap
4. Check the ingress operator's logs:
oc -n openshift-ingress-operator logs -c ingress-operator deployments/ingress-operator
The ingress operator logs several attempts to add the finalizer to the IngressController after it has been marked for deletion:
2023-06-15T02:17:12.419Z ERROR operator.init controller/controller.go:273 Reconciler error {"controller": "clientca_configmap_controller", "object": {"name":"test-client-ca-configmap","namespace":"openshift-ingress-operator"}, "namespace": "openshift-ingress-operator", "name": "test-client-ca-configmap", "reconcileID": "2274f55e-e5bd-4fdb-973e-821a44cf2ebf", "error": "failed to add client-ca-configmap finalizer: IngressController.operator.openshift.io \"test-client-ca-configmap\" is invalid: metadata.finalizers: Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{\"ingresscontroller.operator.openshift.io/finalizer-clientca-configmap\"}"}
The deletion does succeed, errors notwithstanding.
The ingress operator should succeed in deleting the IngressController without attempting to re-add the finalizer to the IngressController after it has been marked for deletion.
Please review the following PR: https://github.com/openshift/kubernetes-autoscaler/pull/270
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
CU has deployed OCP 4.12.rc dev-preview release using Agent based installer. While installation it was observed that [<install-dir>/.openshift_install.log] file only contains the logs of openshift-install agent create image command and other logs are missing [openshift-install agent wait-for]. However with the previous release on IPI/UPI the logs of [wait-for] was available in [.openshift_install.log] file. CU wants to understand is there is a change in functionality of openshift-install, with agent command and can it be made available ?
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1.Install OCP 4.12 dev-release using Agent Based installer. 2.Only logs visible of [agent iso create] not for "wait-for" command. 3.
Actual results:
agent wait-for Logs are missing
Expected results:
openshift-install agent wait-for install-complete should logs into .openshift_install.log file
Additional info:
Description of problem:
Attempted upgrade of 3480 SNOs that were deployed from 4.13.11 to 4.14.0-rc.0 and 15 SNOs ended up stuck in partial upgrade because the cluster console operator was not available # cat 4.14.0-rc.0-partial.console | xargs -I % sh -c "echo -n '% '; oc --kubeconfig /root/hv-vm/kc/%/kubeconfig get clusterversion --no-headers" vm00255 version 4.13.11 True True 21h Unable to apply 4.14.0-rc.0: the cluster operator console is not available vm00320 version 4.13.11 True True 21h Unable to apply 4.14.0-rc.0: the cluster operator console is not available vm00327 version 4.13.11 True True 21h Unable to apply 4.14.0-rc.0: the cluster operator console is not available vm00405 version 4.13.11 True True 21h Unable to apply 4.14.0-rc.0: the cluster operator console is not available vm00705 version 4.13.11 True True 21h Unable to apply 4.14.0-rc.0: the cluster operator console is not available vm01224 version 4.13.11 True True 19h Unable to apply 4.14.0-rc.0: the cluster operator console is not available vm01310 version 4.13.11 True True 19h Unable to apply 4.14.0-rc.0: the cluster operator console is not available vm01320 version 4.13.11 True True 19h Unable to apply 4.14.0-rc.0: the cluster operator console is not available vm01928 version 4.13.11 True True 19h Unable to apply 4.14.0-rc.0: the cluster operator console is not available vm02052 version 4.13.11 True True 19h Unable to apply 4.14.0-rc.0: the cluster operator console is not available vm02588 version 4.13.11 True True 17h Unable to apply 4.14.0-rc.0: the cluster operator console is not available vm02704 version 4.13.11 True True 17h Unable to apply 4.14.0-rc.0: wait has exceeded 40 minutes for these operators: console vm02835 version 4.13.11 True True 17h Unable to apply 4.14.0-rc.0: the cluster operator console is not available vm03110 version 4.13.11 True True 15h Unable to apply 4.14.0-rc.0: the cluster operator console is not available vm03322 version 4.13.11 True True 15h Unable to apply 4.14.0-rc.0: wait has exceeded 40 minutes for these operators: console
Version-Release number of selected component (if applicable):
SNO OCP (managed clusters being upgraded) 4.13.11 upgraded to 4.14.0-rc.0 Hub OCP 4.13.12 ACM - 2.9.0-DOWNSTREAM-2023-09-07-04-47-52
How reproducible:
15 out of 3489 SNos being upgraded however represented 15 out of the 41 partial upgrade failures group (~36% of the failures)
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/cloud-provider-gcp/pull/38
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-30102. The following is the description of the original issue:
—
Description of problem:
For high scalability, we need an option to disable unused machine management control plane components.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Create HostedCluster/HostedControlPlane 2. 3.
Actual results:
Machine management components (cluster-api, machine-approver, auto-scaler, etc) are deployed
Expected results:
Should have option to disable as some use cases they provide no utility.
Additional info:
Please review the following PR: https://github.com/openshift/cluster-kube-storage-version-migrator-operator/pull/94
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When trying to create sriov pods, pods are stuck in state ContainerCreating.
pod definition:
apiVersion: v1 kind: Pod metadata: name: test-sriov-pod namespace: default annotations: v1.multus-cni.io/default-network: default/ftnetattach labels: pod-name: ft-iperf-server-pod-v4 spec: containers: - name: ft-iperf-server-pod-v4 image: quay.io/wizhao/ft-base-image:0.8-x86_64
net-attach-def:
apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: annotations: k8s.v1.cni.cncf.io/resourceName: openshift.io/mlxnics kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"k8s.cni.cncf.io/v1","kind":"NetworkAttachmentDefinition","metadata":{"annotations":{"k8s.v1.cni.cncf.io/resourceName":"openshift.io/mlxnics"},"name":"ftnetattach","namespace":"default"},"spec":{"config":"{\"cniVersion\":\"0.3.1\",\"name\":\"ftnetattach\",\"type\":\"ovn-k8s-cni-overlay\",\"logFile\":\"/var/log/ovn-kubernetes/flowtest.log\",\"logLevel\":\"4\",\"ipam\":{},\"dns\":{}}"}} creationTimestamp: "2023-10-27T20:59:38Z" generation: 1 name: ftnetattach namespace: default resourceVersion: "241792" uid: c394f8bc-20bc-4d0f-b5ce-9f5baad7c3de spec: config: '{"cniVersion":"0.3.1","name":"ftnetattach","type":"ovn-k8s-cni-overlay","logFile":"/var/log/ovn-kubernetes/flowtest.log","logLevel":"4","ipam":{},"dns":{}}'
From a bisect of when this error started occurring, it appears this error was triggered with this change: https://github.com/ovn-org/ovn-kubernetes/pull/3958
Version-Release number of selected component (if applicable):
How reproducible:
Everytime
Steps to Reproduce:
1. Deploy sriov network operator 2. Apply ovn-k8s-cni-overlay net-attach-def 3. Create pod
Actual results:
[]# oc get pod test-sriov-pod NAME READY STATUS RESTARTS AGE test-sriov-pod 0/1 ContainerCreating 0 2d18h [] oc describe pod test-sriov-pod <....> Warning FailedCreatePodSandBox 36s (x18366 over 2d18h) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_test-sriov-pod_default_12194f6e-96ea-4255-be89-a05c57e7d85b_0(cfd3586aa90898cb4197f9c659b80f9e50989fc847e7722a529d137d450a9feb): error adding pod default_test-sriov-pod to CNI network "multus-cni-network": plugin type="multus-shim" name="multus-cni-network" failed (add): CmdAdd (shim): CNI request failed with status 400: '&{ContainerID:cfd3586aa90898cb4197f9c659b80f9e50989fc847e7722a529d137d450a9feb Netns:/var/run/netns/58ad326c-68fe-487a-b449-ff1e0d9bbb64 IfName:eth0 Args:IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=test-sriov-pod;K8S_POD_INFRA_CONTAINER_ID=cfd3586aa90898cb4197f9c659b80f9e50989fc847e7722a529d137d450a9feb;K8S_POD_UID=12194f6e-96ea-4255-be89-a05c57e7d85b Path: StdinData:[123 34 98 105 110 68 105 114 34 58 34 47 118 97 114 47 108 105 98 47 99 110 105 47 98 105 110 34 44 34 99 104 114 111 111 116 68 105 114 34 58 34 47 104 111 115 116 114 111 111 116 34 44 34 99 108 117 115 116 101 114 78 101 116 119 111 114 107 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 99 110 105 47 110 101 116 46 100 47 49 48 45 111 118 110 45 107 117 98 101 114 110 101 116 101 115 46 99 111 110 102 34 44 34 99 110 105 67 111 110 102 105 103 68 105 114 34 58 34 47 104 111 115 116 47 101 116 99 47 99 110 105 47 110 101 116 46 100 34 44 34 99 110 105 86 101 114 115 105 111 110 34 58 34 48 46 51 46 49 34 44 34 100 97 101 109 111 110 83 111 99 107 101 116 68 105 114 34 58 34 47 114 117 110 47 109 117 108 116 117 115 47 115 111 99 107 101 116 34 44 34 103 108 111 98 97 108 78 97 109 101 115 112 97 99 101 115 34 58 34 100 101 102 97 117 108 116 44 111 112 101 110 115 104 105 102 116 45 109 117 108 116 117 115 44 111 112 101 110 115 104 105 102 116 45 115 114 105 111 118 45 110 101 116 119 111 114 107 45 111 112 101 114 97 116 111 114 34 44 34 108 111 103 76 101 118 101 108 34 58 34 118 101 114 98 111 115 101 34 44 34 108 111 103 84 111 83 116 100 101 114 114 34 58 116 114 117 101 44 34 109 117 108 116 117 115 65 117 116 111 99 111 110 102 105 103 68 105 114 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 99 110 105 47 110 101 116 46 100 34 44 34 109 117 108 116 117 115 67 111 110 102 105 103 70 105 108 101 34 58 34 97 117 116 111 34 44 34 110 97 109 101 34 58 34 109 117 108 116 117 115 45 99 110 105 45 110 101 116 119 111 114 107 34 44 34 110 97 109 101 115 112 97 99 101 73 115 111 108 97 116 105 111 110 34 58 116 114 117 101 44 34 112 101 114 78 111 100 101 67 101 114 116 105 102 105 99 97 116 101 34 58 123 34 98 111 111 116 115 116 114 97 112 75 117 98 101 99 111 110 102 105 103 34 58 34 47 118 97 114 47 108 105 98 47 107 117 98 101 108 101 116 47 107 117 98 101 99 111 110 102 105 103 34 44 34 99 101 114 116 68 105 114 34 58 34 47 101 116 99 47 99 110 105 47 109 117 108 116 117 115 47 99 101 114 116 115 34 44 34 99 101 114 116 68 117 114 97 116 105 111 110 34 58 34 50 52 104 34 44 34 101 110 97 98 108 101 100 34 58 116 114 117 101 125 44 34 115 111 99 107 101 116 68 105 114 34 58 34 47 104 111 115 116 47 114 117 110 47 109 117 108 116 117 115 47 115 111 99 107 101 116 34 44 34 116 121 112 101 34 58 34 109 117 108 116 117 115 45 115 104 105 109 34 125]} ContainerID:"cfd3586aa90898cb4197f9c659b80f9e50989fc847e7722a529d137d450a9feb" Netns:"/var/run/netns/58ad326c-68fe-487a-b449-ff1e0d9bbb64" IfName:"eth0" Args:"IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=test-sriov-pod;K8S_POD_INFRA_CONTAINER_ID=cfd3586aa90898cb4197f9c659b80f9e50989fc847e7722a529d137d450a9feb;K8S_POD_UID=12194f6e-96ea-4255-be89-a05c57e7d85b" Path:"" ERRORED: error configuring pod [default/test-sriov-pod] networking: [default/test-sriov-pod/12194f6e-96ea-4255-be89-a05c57e7d85b:ftnetattach]: error adding container to network "ftnetattach": failed to send CNI request: Post "http://dummy/": EOF
Expected results:
pod is created and allocated device
Additional info:
Red Hat CoreOS: 414.92.202310270216-0 Cluster version: 4.14.0-0.nightly-multi-2023-10-27-070855
This is a clone of issue OCPBUGS-28548. The following is the description of the original issue:
—
Description of problem:
In https://github.com/openshift/release/pull/47648 ecr-credentials-provider is built in CI and later included in RHCOS. In order to make it work on OKD it needs to be included in the payload, so that OKD machine-os could extract RPM and install it on the host
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Ref: OCPBUGS-25662
Description of problem:
Deleting the node with the Ingress VIP using oc delete node causes a keepalived split-brain
Version-Release number of selected component (if applicable):
4.12, 4.14
How reproducible:
100%
Steps to Reproduce:
1. In an OpenShift cluster installed via vSphere IPI, check the node with the Ingress VIP. 2. Delete the node. 3. Check the discrepancy between machines objects and nodes. There will be more machines than nodes. 4. SSH to the deleted node, and check the VIP is still mounted and keepalived pods are running. 5. Check the VIP is also mounted in another worker. 6. SSH to the node and check the VIP is still present.
Actual results:
The deleted node still has the VIP present and the ingress fails sometimes
Expected results:
The deleted node should not have the VIP present and the ingress should not fail.
Additional info:
Description of problem:
We have observed that when creating clusters through OCM using the Hive provisioner, which uses OpenShift installer, sometimes some of the AWS IAM Instance Profiles are not cleant up when their corresponding cluster.
Version-Release number of selected component (if applicable):
"time=\"2023-09-11T10:37:10Z\" level=debug msg=\"OpenShift Installer v4.12.0\""
How reproducible:
At the moment we have not found a way to reproduce it consistently, but what we observe is that it does not seem to be an isolated case due to we ended up accumulating AWS IAM Instance Profiles in the AWS account that we are making use for our tests.
Actual results:
Sometimes some of the AWS IAM instance profiles associated to the cluster that has been deleted are also cleant up
Expected results:
The AWS IAM instance profiles associated to the cluster that has been deleted are also deleted.
Additional info:
In https://issues.redhat.com/browse/OCM-2748 we have been doing an investigation of accumulated AWS IAM Instance Profiles in one of our AWS accounts. If you are interested in full details of the investigation please take a look at the issue and its comments.
Focusing on the instance profiles associated to clusters that we create as part of our test suite we see that the majority of them are worker instance profiles. We also see some occurrences of master and bootstrap instance profiles but for the purposes of the investigation we focused on worker profile because they are the vast majority of the accumulated ones.
For the purposes of the investigation we focused on a specific cluster 'cs-ci-2lmxd' and we have seen that the worker iam instance profile was created by the openshift installer:
time="2023-09-11T10:37:43Z" level=debug msg="module.iam.aws_iam_instance_profile.worker: Creation complete after 0s [id=cs-ci-2lmxd-9qtk4-worker-profile]"
But we found that when the cluster was deleted the openshift installer didn't delete it.
However, we could see that the master profile was created:
time="2023-09-11T10:37:43Z" level=debug msg="module.masters.aws_iam_instance_profile.master: Creation complete after 0s [id=cs-ci-2lmxd-9qtk4-master-profile]"
but in this case openshift installer deleted it properly when the cluster was deleted:
time="2023-09-11T10:49:58Z" level=info msg=Deleted InstanceProfileName=cs-ci-2lmxd-9qtk4-master-profile arn="arn:aws:iam::765374464689:instance-profile/cs-ci-2lmxd-9qtk4-master-profile" id=i-079f2d1580240e3cb resourceType=instance
As additional information, I can see that the worker profile has no tags:
msoriano@localhost:~/go/src/gitlab.cee.redhat.com/service/uhc-clusters-service (master)(ocm:S)$ aws iam list-instance-profile-tags --instance-profile-name=cs-ci-2lmxd-9qtk4-worker-profile
{
"Tags": []
}
I attach the install and uninstall logs in this issue too.
Please review the following PR: https://github.com/openshift/machine-api-provider-ibmcloud/pull/31
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Tracker issue for bootimage bump in 4.16. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-29441.
Description of problem:
OCP 4.14 installation fails in AWS environments where S3 versioning is enforced. OCP 4.13 installs successfully in the same environment.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Always
Steps to Reproduce:
1. Use any native AWS ways to enforce Versioning on S3. AWS Config is easiest. This will enable versioning on S3 buckets after creation. 2. Install OCP 4.13 on AWS just using the defaults. It will succeed. 3. Install OCP 4.14 on AWS just using the defaults. It will fail.
Actual results:
OCP 4.14 installation fails fatally.
Expected results:
OCP 4.14 installation succeeds just like OCP 4.13 installation. OR - if defaults are changed, provided documentation.
Additional info:
1. Related 4.14 feature : https://docs.openshift.com/container-platform/4.14/release_notes/ocp-4-14-release-notes.html#ocp-4-14-aws-s3-deletion - provides the ability to skip deletion of S3 buckets altogether. 2. Attached OCP logs. 3. Strategic enterprise customers of managed services use data governance policies that enforce versioning, bucket policy etc that are blocked from installing
Tracker issue for bootimage bump in 4.15. This issue should block issues which need a bootimage bump to fix.
The previous bump was OCPBUGS-22757.
This is a clone of issue OCPBUGS-25898. The following is the description of the original issue:
—
Description of problem:
PipelineRun logs page navigation is broken on navigate through the task on the PiplineRun log tab.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Navigate to PipelineRuns details page and select the Logs tab. 2. Navigate through the tasks of the PipelineRun tasks
Actual results:
- Details tab gets active on selection of any task - Logs page gets empty on seldction of Logs tab again - Last task is not selected for completed PipelineRuns
Expected results:
- Logs tab should be active when user is not the Logs tab - Last task should be selected in case of the completed PipelineRuns
Additional info:
It is a regression after change in logic of tab selection in HorizontalNav component.
Video- https://drive.google.com/file/d/15fx9GWO2dRh4uaibRmZ4VTk4HFxQ7NId/view?usp=sharing
Please review the following PR: https://github.com/openshift/cluster-kube-controller-manager-operator/pull/747
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-25760. The following is the description of the original issue:
—
Description of problem:
During live OVN migration, network operator show the error message: Not applying unsafe configuration change: invalid configuration: [cannot change default network type when not doing migration]. Use 'oc edit network.operator.openshift.io cluster' to undo the change.
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Create 4.15 nightly SDN ROSA cluster 2. oc delete validatingwebhookconfigurations.admissionregistration.k8s.io/sre-techpreviewnoupgrade-validation 3. oc edit featuregate cluster to enable featuregates 4. Wait for all node rebooting and back to normal 5. oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'
Actual results:
[weliang@weliang ~]$ oc delete validatingwebhookconfigurations.admissionregistration.k8s.io/sre-techpreviewnoupgrade-validation[weliang@weliang ~]$ oc edit featuregate cluster[weliang@weliang ~]$ oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/network-type-migration":""}},"spec":{"networkType":"OVNKubernetes"}}'network.config.openshift.io/cluster patched[weliang@weliang ~]$ [weliang@weliang ~]$ oc get co networkNAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGEnetwork 4.15.0-0.nightly-2023-12-18-220750 True False True 105m Not applying unsafe configuration change: invalid configuration: [cannot change default network type when not doing migration]. Use 'oc edit network.operator.openshift.io cluster' to undo the change.[weliang@weliang ~]$ oc describe Network.config.openshift.io clusterName: clusterNamespace: Labels: <none>Annotations: network.openshift.io/network-type-migration: API Version: config.openshift.io/v1Kind: NetworkMetadata: Creation Timestamp: 2023-12-20T15:13:39Z Generation: 3 Resource Version: 119899 UID: 6a621b88-ac4f-4918-a7f6-98dba7df222cSpec: Cluster Network: Cidr: 10.128.0.0/14 Host Prefix: 23 External IP: Policy: Network Type: OVNKubernetes Service Network: 172.30.0.0/16Status: Cluster Network: Cidr: 10.128.0.0/14 Host Prefix: 23 Cluster Network MTU: 8951 Network Type: OpenShiftSDN Service Network: 172.30.0.0/16Events: <none>[weliang@weliang ~]$ oc describe Network.operator.openshift.io clusterName: clusterNamespace: Labels: <none>Annotations: <none>API Version: operator.openshift.io/v1Kind: NetworkMetadata: Creation Timestamp: 2023-12-20T15:15:37Z Generation: 275 Resource Version: 120026 UID: 278bd491-ac88-4038-887f-d1defc450740Spec: Cluster Network: Cidr: 10.128.0.0/14 Host Prefix: 23 Default Network: Openshift SDN Config: Enable Unidling: true Mode: NetworkPolicy Mtu: 8951 Vxlan Port: 4789 Type: OVNKubernetes Deploy Kube Proxy: false Disable Multi Network: false Disable Network Diagnostics: false Kube Proxy Config: Bind Address: 0.0.0.0 Log Level: Normal Management State: Managed Observed Config: <nil> Operator Log Level: Normal Service Network: 172.30.0.0/16 Unsupported Config Overrides: <nil> Use Multi Network Policy: falseStatus: Conditions: Last Transition Time: 2023-12-20T15:15:37Z Status: False Type: ManagementStateDegraded Last Transition Time: 2023-12-20T16:58:58Z Message: Not applying unsafe configuration change: invalid configuration: [cannot change default network type when not doing migration]. Use 'oc edit network.operator.openshift.io cluster' to undo the change. Reason: InvalidOperatorConfig Status: True Type: Degraded Last Transition Time: 2023-12-20T15:15:37Z Status: True Type: Upgradeable Last Transition Time: 2023-12-20T16:52:11Z Status: False Type: Progressing Last Transition Time: 2023-12-20T15:15:45Z Status: True Type: Available Ready Replicas: 0 Version: 4.15.0-0.nightly-2023-12-18-220750Events: <none>[weliang@weliang ~]$ oc get clusterversionNAME VERSION AVAILABLE PROGRESSING SINCE STATUSversion 4.15.0-0.nightly-2023-12-18-220750 True False 84m Error while reconciling 4.15.0-0.nightly-2023-12-18-220750: the cluster operator network is degraded[weliang@weliang ~]$
Expected results:
Migration success
Additional info:
Get same error message from ROSA and GCP cluster.
Description of problem:
Trying to create the second cluster using the same cluster name and base domain as the first cluster would fail, as expected, because of the dns record-sets conflicts. But deleting the second cluster leads to the first cluster inaccessible, which is unexpected.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-14-100410
How reproducible:
Always
Steps to Reproduce:
1. create the first cluster and make sure it succeeds 2. try to create the second cluster, with the same cluster name, base domain, and region, and make sure it failed 3. destroy the second cluster which failed due to "Platform Provisioning Check" 4. check if the first cluster is still healthy
Actual results:
The first cluster turns unhealthy, because the dns record-sets are deleted by step3
Expected results:
The dns record-sets of the first cluster stay untouched during step3, and the the first cluster stays healthy after step3.
Additional info:
(1) the first cluster is by Flexy-install job https://mastern-jenkins-csb-openshift-qe.apps.ocp-c1.prod.psi.redhat.com/job/ocp-common/job/Flexy-install/257549/, and it's healthy initially $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.0-0.nightly-2024-01-14-100410 True False 54m Cluster version is 4.15.0-0.nightly-2024-01-14-100410 $ oc get nodes NAME STATUS ROLES AGE VERSION jiwei-0115y-lgns8-master-0.c.openshift-qe.internal Ready control-plane,master 73m v1.28.5+c84a6b8 jiwei-0115y-lgns8-master-1.c.openshift-qe.internal Ready control-plane,master 73m v1.28.5+c84a6b8 jiwei-0115y-lgns8-master-2.c.openshift-qe.internal Ready control-plane,master 74m v1.28.5+c84a6b8 jiwei-0115y-lgns8-worker-a-gqq96.c.openshift-qe.internal Ready worker 62m v1.28.5+c84a6b8 jiwei-0115y-lgns8-worker-b-2h9xd.c.openshift-qe.internal Ready worker 63m v1.28.5+c84a6b8 $ (2) try to create the second cluster and expect failing due to dns record already exists $ openshift-install version openshift-install 4.15.0-0.nightly-2024-01-14-100410 built from commit b6f320ab7eeb491b2ef333a16643c140239de0e5 release image registry.ci.openshift.org/ocp/release@sha256:385d84c803c776b44ce77b80f132c1b6ed10bd590f868c97e3e63993b811cc2d release architecture amd64 $ mkdir test1 $ cp install-config.yaml test1 $ yq-3.3.0 r test1/install-config.yaml baseDomain qe.gcp.devcluster.openshift.com $ yq-3.3.0 r test1/install-config.yaml metadata creationTimestamp: null name: jiwei-0115y $ yq-3.3.0 r test1/install-config.yaml platform gcp: projectID: openshift-qe region: us-central1 $ openshift-install create cluster --dir test1 INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" INFO Consuming Install Config from target directory FATAL failed to fetch Terraform Variables: failed to fetch dependency of "Terraform Variables": failed to generate asset "Platform Provisioning Check": metadata.name: Invalid value: "jiwei-0115y": record(s) ["api.jiwei-0115y.qe.gcp.devcluster.openshift.com."] already exists in DNS Zone (openshift-qe/qe) and might be in use by another cluster, please remove it to continue $ (3) delete the second cluster $ openshift-install destroy cluster --dir test1 INFO Credentials loaded from file "/home/fedora/.gcp/osServiceAccount.json" INFO Deleted 2 recordset(s) in zone qe INFO Deleted 3 recordset(s) in zone jiwei-0115y-lgns8-private-zone WARNING Skipping deletion of DNS Zone jiwei-0115y-lgns8-private-zone, not created by installer INFO Time elapsed: 37s INFO Uninstallation complete! $ (4) check the first cluster status and the dns record-sets $ oc get clusterversion Unable to connect to the server: dial tcp: lookup api.jiwei-0115y.qe.gcp.devcluster.openshift.com on 10.11.5.160:53: no such host $ $ gcloud dns managed-zones describe jiwei-0115y-lgns8-private-zone cloudLoggingConfig: kind: dns#managedZoneCloudLoggingConfig creationTime: '2024-01-15T07:22:55.199Z' description: Created By OpenShift Installer dnsName: jiwei-0115y.qe.gcp.devcluster.openshift.com. id: '9193862213315831261' kind: dns#managedZone labels: kubernetes-io-cluster-jiwei-0115y-lgns8: owned name: jiwei-0115y-lgns8-private-zone nameServers: - ns-gcp-private.googledomains.com. privateVisibilityConfig: kind: dns#managedZonePrivateVisibilityConfig networks: - kind: dns#managedZonePrivateVisibilityConfigNetwork networkUrl: https://www.googleapis.com/compute/v1/projects/openshift-qe/global/networks/jiwei-0115y-lgns8-network visibility: private $ gcloud dns record-sets list --zone jiwei-0115y-lgns8-private-zone NAME TYPE TTL DATA jiwei-0115y.qe.gcp.devcluster.openshift.com. NS 21600 ns-gcp-private.googledomains.com. jiwei-0115y.qe.gcp.devcluster.openshift.com. SOA 21600 ns-gcp-private.googledomains.com. cloud-dns-hostmaster.google.com. 1 21600 3600 259200 300 $ gcloud dns record-sets list --zone qe --filter='name~jiwei-0115y' Listed 0 items. $
Please review the following PR: https://github.com/openshift/cluster-api-operator/pull/24
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-26492. The following is the description of the original issue:
—
Description of problem:
Operation cannot be fulfilled on networks.operator.openshift.io during OVN live migration
Version-Release number of selected component (if applicable):
How reproducible:
Not always
Steps to Reproduce:
1. Enable features of egressfirewall, externalIP,multicast, multus, network-policy, service-idle. 2. Start migrate SDN to OVN cluster
Actual results:
[weliang@weliang ~]$ oc delete validatingwebhookconfigurations.admissionregistration.k8s.io/sre-techpreviewnoupgrade-validation validatingwebhookconfiguration.admissionregistration.k8s.io "sre-techpreviewnoupgrade-validation" deleted [weliang@weliang ~]$ oc edit featuregate cluster featuregate.config.openshift.io/cluster edited [weliang@weliang ~]$ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-20-154.ec2.internal Ready control-plane,master 86m v1.28.5+9605db4 ip-10-0-45-93.ec2.internal Ready worker 80m v1.28.5+9605db4 ip-10-0-49-245.ec2.internal Ready worker 74m v1.28.5+9605db4 ip-10-0-57-37.ec2.internal Ready infra,worker 60m v1.28.5+9605db4 ip-10-0-60-0.ec2.internal Ready infra,worker 60m v1.28.5+9605db4 ip-10-0-62-121.ec2.internal Ready control-plane,master 86m v1.28.5+9605db4 ip-10-0-62-56.ec2.internal Ready control-plane,master 86m v1.28.5+9605db4 [weliang@weliang ~]$ for f in $(oc get nodes -o jsonpath='{.items[*].metadata.name}') ; do oc debug node/"${f}" -- chroot /host cat /etc/kubernetes/kubelet.conf | grep NetworkLiveMigration ; done Starting pod/ip-10-0-20-154ec2internal-debug-9wvd8 ... To use host binaries, run `chroot /host`Removing debug pod ... "NetworkLiveMigration": true, Starting pod/ip-10-0-45-93ec2internal-debug-rwvls ... To use host binaries, run `chroot /host` "NetworkLiveMigration": true,Removing debug pod ... Starting pod/ip-10-0-49-245ec2internal-debug-rp9dt ... To use host binaries, run `chroot /host`Removing debug pod ... "NetworkLiveMigration": true, Starting pod/ip-10-0-57-37ec2internal-debug-q5thk ... To use host binaries, run `chroot /host`Removing debug pod ... "NetworkLiveMigration": true, Starting pod/ip-10-0-60-0ec2internal-debug-zp78h ... To use host binaries, run `chroot /host`Removing debug pod ... "NetworkLiveMigration": true, Starting pod/ip-10-0-62-121ec2internal-debug-42k2g ... To use host binaries, run `chroot /host`Removing debug pod ... "NetworkLiveMigration": true, Starting pod/ip-10-0-62-56ec2internal-debug-s99ls ... To use host binaries, run `chroot /host`Removing debug pod ... "NetworkLiveMigration": true, [weliang@weliang ~]$ oc patch Network.config.openshift.io cluster --type='merge' --patch '{"metadata":{"annotations":{"network.openshift.io/live-migration":""}},"spec":{"networkType":"OVNKubernetes"}}' network.config.openshift.io/cluster patched [weliang@weliang ~]$ [weliang@weliang ~]$ oc get co network NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE network 4.15.0-0.nightly-2024-01-06-062415 True False True 4h1m Internal error while updating operator configuration: could not apply (/, Kind=) /cluster, err: failed to apply / update (operator.openshift.io/v1, Kind=Network) /cluster: Operation cannot be fulfilled on networks.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again [weliang@weliang ~]$ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-2-52.ec2.internal Ready worker 3h54m v1.28.5+9605db4 ip-10-0-26-16.ec2.internal Ready control-plane,master 4h2m v1.28.5+9605db4 ip-10-0-32-116.ec2.internal Ready worker 3h54m v1.28.5+9605db4 ip-10-0-32-67.ec2.internal Ready infra,worker 3h38m v1.28.5+9605db4 ip-10-0-35-11.ec2.internal Ready infra,worker 3h39m v1.28.5+9605db4 ip-10-0-39-125.ec2.internal Ready control-plane,master 4h2m v1.28.5+9605db4 ip-10-0-6-117.ec2.internal Ready control-plane,master 4h2m v1.28.5+9605db4 [weliang@weliang ~]$ oc get Network.operator.openshift.io/cluster -o json { "apiVersion": "operator.openshift.io/v1", "kind": "Network", "metadata": { "creationTimestamp": "2024-01-08T13:28:07Z", "generation": 417, "name": "cluster", "resourceVersion": "236888", "uid": "37fb36f0-c13c-476d-aea1-6ebc1c87abe8" }, "spec": { "clusterNetwork": [ { "cidr": "10.128.0.0/14", "hostPrefix": 23 } ], "defaultNetwork": { "openshiftSDNConfig": { "enableUnidling": true, "mode": "NetworkPolicy", "mtu": 8951, "vxlanPort": 4789 }, "ovnKubernetesConfig": { "egressIPConfig": {}, "gatewayConfig": { "ipv4": {}, "ipv6": {}, "routingViaHost": false }, "genevePort": 6081, "mtu": 8901, "policyAuditConfig": { "destination": "null", "maxFileSize": 50, "maxLogFiles": 5, "rateLimit": 20, "syslogFacility": "local0" } }, "type": "OVNKubernetes" }, "deployKubeProxy": false, "disableMultiNetwork": false, "disableNetworkDiagnostics": false, "kubeProxyConfig": { "bindAddress": "0.0.0.0" }, "logLevel": "Normal", "managementState": "Managed", "migration": { "mode": "Live", "networkType": "OVNKubernetes" }, "observedConfig": null, "operatorLogLevel": "Normal", "serviceNetwork": [ "172.30.0.0/16" ], "unsupportedConfigOverrides": null, "useMultiNetworkPolicy": false }, "status": { "conditions": [ { "lastTransitionTime": "2024-01-08T13:28:07Z", "status": "False", "type": "ManagementStateDegraded" }, { "lastTransitionTime": "2024-01-08T17:29:52Z", "status": "False", "type": "Degraded" }, { "lastTransitionTime": "2024-01-08T13:28:07Z", "status": "True", "type": "Upgradeable" }, { "lastTransitionTime": "2024-01-08T17:26:38Z", "status": "False", "type": "Progressing" }, { "lastTransitionTime": "2024-01-08T13:28:20Z", "status": "True", "type": "Available" } ], "readyReplicas": 0, "version": "4.15.0-0.nightly-2024-01-06-062415" } } [weliang@weliang ~]$
Expected results:
OVN live migration pass
Additional info:
must-gather: https://people.redhat.com/~weliang/must-gather1.tar.gz
Description of problem:
An operator installPlan has duplicate key values for installPlan?.spec.clusterServiceVersionNames which is displayed in multiple pages in the management console.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-07-31-181848
How reproducible:
Always
Expected results:
In the screenshots linked below the clusterServiceVersionNames value should only display one item, but because their are duplicate key values it lists it twice.
Additional info:
This bug causes duplicate values to be shown in several pages of the Management Console. screenshots
https://drive.google.com/file/d/1OwiLXU8iETNusCf6N2AhB5y-ykXwgyBU/view?usp=drive_link
https://drive.google.com/file/d/1qfMso1x-s--samU7OmDKU-3NVfxqsxWD/view?usp=drive_link
https://drive.google.com/file/d/1Z9mGRllp4ZLN2OlSNKZY2QTIDx8QpyVS/view?usp=drive_link
https://drive.google.com/file/d/1CYWMpKy_KmUV_KfIxCjS1FAWHYbYA6rw/view?usp=drive_link
This is a clone of issue OCPBUGS-24526. The following is the description of the original issue:
—
Description of problem:
Snapshots taken to gather deprecation information from bundles are from the Subscription namespace instead of the CatalogSource namespace. That means that if the Subscription is in a different namespace then no bundles will be present in the snapshot.
How reproducible:
100%
Steps to Reproduce:
1.Create CatalogSource with olm.deprecation entries 2.Create Subscription targeting a package with deprecations in a different namespace.
Actual results:
No Deprecation Conditions will be present.
Expected results:
Deprecation Conditions should be present.
This is a clone of issue OCPBUGS-18577. The following is the description of the original issue:
—
Description of problem:
Must-gather link
long snippet from e2e log
external internet 09/01/23 07:26:09.624 Sep 1 07:26:09.624: INFO: Running 'oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 http://www.google.com:80' STEP: creating an egressfirewall object 09/01/23 07:26:09.903 STEP: calling oc create -f /tmp/fixture-testdata-dir978363556/test/extended/testdata/egress-firewall/ovnk-egressfirewall-test.yaml 09/01/23 07:26:09.903 Sep 1 07:26:09.904: INFO: Running 'oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/root/.kube/config create -f /tmp/fixture-testdata-dir978363556/test/extended/testdata/egress-firewall/ovnk-egressfirewall-test.yaml' egressfirewall.k8s.ovn.org/default createdSTEP: sending traffic to control plane nodes should work 09/01/23 07:26:22.122 Sep 1 07:26:22.130: INFO: Running 'oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 -k https://193.168.200.248:6443' Sep 1 07:26:23.358: INFO: Error running /usr/local/bin/oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 -k https://193.168.200.248:6443: StdOut> command terminated with exit code 28 StdErr> command terminated with exit code 28[AfterEach] [sig-network][Feature:EgressFirewall] github.com/openshift/origin/test/extended/util/client.go:180 STEP: Collecting events from namespace "e2e-test-egress-firewall-e2e-2vvzx". 09/01/23 07:26:23.358 STEP: Found 4 events. 09/01/23 07:26:23.361 Sep 1 07:26:23.361: INFO: At 2023-09-01 07:26:08 -0400 EDT - event for egressfirewall: {multus } AddedInterface: Add eth0 [10.131.0.89/23] from ovn-kubernetes Sep 1 07:26:23.361: INFO: At 2023-09-01 07:26:08 -0400 EDT - event for egressfirewall: {kubelet lon06-worker-0.rdr-qe-ocp-upi-7250.redhat.com} Pulled: Container image "quay.io/openshift/community-e2e-images:e2e-quay-io-redhat-developer-nfs-server-1-1-dlXGfzrk5aNo8EjC" already present on machine Sep 1 07:26:23.361: INFO: At 2023-09-01 07:26:08 -0400 EDT - event for egressfirewall: {kubelet lon06-worker-0.rdr-qe-ocp-upi-7250.redhat.com} Created: Created container egressfirewall-container Sep 1 07:26:23.361: INFO: At 2023-09-01 07:26:08 -0400 EDT - event for egressfirewall: {kubelet lon06-worker-0.rdr-qe-ocp-upi-7250.redhat.com} Started: Started container egressfirewall-container Sep 1 07:26:23.363: INFO: POD NODE PHASE GRACE CONDITIONS Sep 1 07:26:23.363: INFO: egressfirewall lon06-worker-0.rdr-qe-ocp-upi-7250.redhat.com Running [{Initialized True 0001-01-01 00:00:00 +0000 UTC 2023-09-01 07:26:07 -0400 EDT } {Ready True 0001-01-01 00:00:00 +0000 UTC 2023-09-01 07:26:09 -0400 EDT } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2023-09-01 07:26:09 -0400 EDT } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2023-09-01 07:26:07 -0400 EDT }] Sep 1 07:26:23.363: INFO: Sep 1 07:26:23.367: INFO: skipping dumping cluster info - cluster too large Sep 1 07:26:23.383: INFO: Deleted {user.openshift.io/v1, Resource=users e2e-test-egress-firewall-e2e-2vvzx-user}, err: <nil> Sep 1 07:26:23.398: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients e2e-client-e2e-test-egress-firewall-e2e-2vvzx}, err: <nil> Sep 1 07:26:23.414: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens sha256~X_2HPGEj3O9hpd-3XKTckrp9bO23s_7zlJ3Tkn7ncBE}, err: <nil> [AfterEach] [sig-network][Feature:EgressFirewall] github.com/openshift/origin/test/extended/util/client.go:180 STEP: Collecting events from namespace "e2e-test-no-egress-firewall-e2e-84f48". 09/01/23 07:26:23.414 STEP: Found 0 events. 09/01/23 07:26:23.416 Sep 1 07:26:23.417: INFO: POD NODE PHASE GRACE CONDITIONS Sep 1 07:26:23.417: INFO: Sep 1 07:26:23.421: INFO: skipping dumping cluster info - cluster too large Sep 1 07:26:23.446: INFO: Deleted {user.openshift.io/v1, Resource=users e2e-test-no-egress-firewall-e2e-84f48-user}, err: <nil> Sep 1 07:26:23.451: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthclients e2e-client-e2e-test-no-egress-firewall-e2e-84f48}, err: <nil> Sep 1 07:26:23.457: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens sha256~2Lk8-jWfwpdyo59E9YF7kQFKH2LBUSvnbJdKj7rOzn4}, err: <nil> [DeferCleanup (Each)] [sig-network][Feature:EgressFirewall] dump namespaces | framework.go:196 STEP: dump namespace information after failure 09/01/23 07:26:23.457 [DeferCleanup (Each)] [sig-network][Feature:EgressFirewall] tear down framework | framework.go:193 STEP: Destroying namespace "e2e-test-no-egress-firewall-e2e-84f48" for this suite. 09/01/23 07:26:23.457 [DeferCleanup (Each)] [sig-network][Feature:EgressFirewall] dump namespaces | framework.go:196 STEP: dump namespace information after failure 09/01/23 07:26:23.462 [DeferCleanup (Each)] [sig-network][Feature:EgressFirewall] tear down framework | framework.go:193 STEP: Destroying namespace "e2e-test-egress-firewall-e2e-2vvzx" for this suite. 09/01/23 07:26:23.463 fail [github.com/openshift/origin/test/extended/networking/egress_firewall.go:155]: Unexpected error: <*fmt.wrapError | 0xc001dd50a0>: { msg: "Error running /usr/local/bin/oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 -k https://193.168.200.248:6443:\nStdOut>\ncommand terminated with exit code 28\nStdErr>\ncommand terminated with exit code 28\nexit status 28\n", err: <*exec.ExitError | 0xc001dd5080>{ ProcessState: { pid: 140483, status: 7168, rusage: { Utime: {Sec: 0, Usec: 149480}, Stime: {Sec: 0, Usec: 19930}, Maxrss: 222592, Ixrss: 0, Idrss: 0, Isrss: 0, Minflt: 1536, Majflt: 0, Nswap: 0, Inblock: 0, Oublock: 0, Msgsnd: 0, Msgrcv: 0, Nsignals: 0, Nvcsw: 596, Nivcsw: 173, }, }, Stderr: nil, }, } Error running /usr/local/bin/oc --namespace=e2e-test-egress-firewall-e2e-2vvzx --kubeconfig=/tmp/configfile1049873803 exec egressfirewall -- curl -q -s -I -m1 -k https://193.168.200.248:6443: StdOut> command terminated with exit code 28 StdErr> command terminated with exit code 28 exit status 28 occurred Ginkgo exit error 1: exit with code 1failed: (18.7s) 2023-09-01T11:26:23 "[sig-network][Feature:EgressFirewall] when using openshift ovn-kubernetes should ensure egressfirewall is created [Suite:openshift/conformance/parallel]"
Version-Release number of selected component (if applicable):
4.13.11
How reproducible:
This e2e failure is not consistently reproduceable.
Steps to Reproduce:
1.Start a Z stream Job via Jenkins 2.monitor e2e
Actual results:
e2e is getting failed
Expected results:
e2e should pass
Additional info:
Please review the following PR: https://github.com/openshift/installer/pull/7817
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
https://github.com/openshift/installer/pull/6770 reverted part of https://github.com/openshift/installer/pull/5788 which has set guestinfo.domain for bootstrap machine. This breaks some OKD installations, which require that setting
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
GCP preemptible VM termination is not being handled correctly by machine-api-termination-handler.
Version-Release number of selected component (if applicable):
Tested on both 4.10.22 and 4.11.2
How reproducible:
To reproduce the issue: Create spot instance machine in gcp. Stop instance, notice in machine-api-termination-handler pod there is no signal in there signifying it was terminated. Note we do see on machines list the TERMINATED status. Result is that pods are not gracefully moved off in the 90sec window before node is turned off. We would expect a terminated node to wait for pods to move off (up to 90sec) and then shutdown, instead of an immediate shutdown of the node.
Steps to Reproduce:
1. Create spot instance machine in gcp. 2. Stop instance 3. Notice in machine-api-termination-handler pod there is no signal in there signifying it was terminated. 4. Note we do see on machines list the TERMINATED status. 5. Result is that pods are not gracefully moved off in the 90sec window before node is turned off.
Actual results:
The machine-api-termination-handler logs don't show any message such as "Instance marked for termination, marking Node for deletion" but instead no signal is received from GCP.
Expected results:
A terminated node should wait for pods to move off (up to 90sec) and then shutdown, instead of an immediate shutdown of the node.
Additional info:
Here is the code:
https://github.com/openshift/machine-api-provider-gcp/blob/main/pkg/termination/termination.go#L96-L127
#forum-cloud slack thread:
https://coreos.slack.com/archives/CBZHF4DHC/p1656524730323259
#forum-node slack thread:
https://coreos.slack.com/archives/CK1AE4ZCK/p1656619821630479
Description of problem:
in 4.14, the MCO became the default provider of image registry certificates. However, all of these certs are put onto disk and into config in cluster. We need a way for components like hypershift, to be able to provide certificates they need to run properly during their bootstrap process.
Version-Release number of selected component (if applicable):
How reproducible:
always with hypershift
Steps to Reproduce:
1. bootstrap a hypershift cluster 2. will fail due to image pull errors
Actual results:
failure due to lack of IR certs
Expected results:
IR certs provided by the component who needs them via a cmd flag, bootstrap success.
Additional info:
Description of problem:
Bootstrap process failed due to coredns.yaml manifest generation issue: Feb 04 05:14:34 yunjiang-p2-2r2b2-bootstrap bootkube.sh[11219]: I0204 05:14:34.966343 1 bootstrap.go:188] manifests/on-prem/coredns.yaml Feb 04 05:14:34 yunjiang-p2-2r2b2-bootstrap bootkube.sh[11219]: F0204 05:14:34.966513 1 bootstrap.go:188] error rendering bootstrap manifests: failed to execute template: template: manifests/on-prem/coredns.yaml:34:32: executing "manifests/on-prem/coredns.yaml" at <onPremPlatformAPIServerInternalIPs .ControllerConfig>: error calling onPremPlatformAPIServerInternalIPs: invalid platform for API Server Internal IP Feb 04 05:14:35 yunjiang-p2-2r2b2-bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=255/EXCEPTION Feb 04 05:14:35 yunjiang-p2-2r2b2-bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-02-03-192446 4.16.0-0.nightly-2024-02-03-221256
How reproducible:
Always
Steps to Reproduce:
1. 1. Enable custom DNS on GCP: platform.gcp.userProvisionedDNS:Enabled and featureSet:TechPreviewNoUpgrade 2. 3.
Actual results:
coredns.yaml can not be generated, bootstrap failed.
Expected results:
Bootstrap process succeeds.
Additional info:
Description of problem:
From time to time the installation fails with something like the one below:
2022-01-03 16:33:27.936 | level=debug msg=Generating Terraform Variables...
2022-01-03 16:33:27.940 | level=info msg=Obtaining RHCOS image file from 'https://rhcos-redirector.apps.art.xq1c.p1.openshiftapps.com/art/storage/releases/rhcos-4.8/48.84.202109241901-0/x86_64/rhcos-48.84.202109241901-0-openstack.x86_64.qcow2.gz?sha256=e0a1d8a99c5869150a56b8de475ea7952ca2fa3aacad7ca48533d1176df503ab'
2022-01-03 16:33:27.943 | level=fatal msg=failed to fetch Terraform Variables: failed to generate asset "Terraform Variables": failed to get openstack Terraform variables: Get "https://rhcos-redirector.apps.art.xq1c.p1.openshiftapps.com/art/storage/releases/rhcos-4.8/48.84.202109241901-0/x86_64/rhcos-48.84.202109241901-0-openstack.x86_64.qcow2.gz": dial tcp: lookup rhcos-redirector.apps.art.xq1c.p1.openshiftapps.com on 10.46.0.31:53: read udp 172.16.40.23:38673->10.46.0.31:53: i/o timeout
2022-01-03 16:33:27.946 |
Version:
4.8.0-0.nightly-2021-12-23-010813 but we see it for other versions as well
IPI
I expect the installer to have some sort of retry mechanism.
In order to evaluate solutions for https://issues.redhat.com/browse/RFE-3953 we need to investigate the root cause of the issue
If there is an issue, have a strategy to display external labels on alerts
Description of problem:
Agent-based Installer fails to deploy a HA cluster (3x masters, 2x workers) with OKD/FCOS when the network DNS server does not resolve the api-int.* endpoint. The latter is not required for HA deployments and is actually never mentioned in OCP docs for Agent-based Installer. OCP is not affected at all.
Version-Release number of selected component (if applicable):
4.13 4.14 4.15
Description of problem:
Use centos stream to build libvirt images
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
time="2024-01-04T05:30:45-05:00" level=fatal msg="failed to fetch Terraform Variables: failed to fetch dependency of \"Terraform Variables\": failed to generate asset \"Platform Provisioning Check\": platform.vsphere: Internal error: vCenter is failing to retrieve config product version information for the ESXi host: "
Description of problem:
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.20 True False 43h Cluster version is 4.11.20 $ oc get clusterrolebinding system:openshift:controller:service-serving-cert-controller -o yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: annotations: rbac.authorization.kubernetes.io/autoupdate: "true" creationTimestamp: "2023-01-11T13:19:24Z" name: system:openshift:controller:service-serving-cert-controller resourceVersion: "11410" uid: 8b3e8c56-9f25-4f89-9159-5300585cc129 roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: system:openshift:controller:service-serving-cert-controller subjects: - kind: ServiceAccount name: service-serving-cert-controller namespace: openshift-infra $ oc get sa service-serving-cert-controller -n openshift-infra Error from server (NotFound): serviceaccounts "service-serving-cert-controller" not found The serviceAccount service-serving-cert-controller does not exist. Neither in openshift-infra nor in any other namespace. It's therefore not clear what this ClusterRoleBinding does, what use-case it does fulfill and why it references non existing serviceAccount. From Security point of view, it's recommended to remove non serviceAccounts from ClusterRoleBindings as a potential attacker could abuse the current state by creating the necessary serviceAccount and gain undesired permissions.
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4 (all version from what we have found)
How reproducible:
Always
Steps to Reproduce:
1. Install OpenShift Container Platform 4 2. Run oc get clusterrolebinding system:openshift:controller:service-serving-cert-controller -o yaml
Actual results:
$ oc get clusterrolebinding system:openshift:controller:service-serving-cert-controller -o yaml apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: annotations: rbac.authorization.kubernetes.io/autoupdate: "true" creationTimestamp: "2023-01-11T13:19:24Z" name: system:openshift:controller:service-serving-cert-controller resourceVersion: "11410" uid: 8b3e8c56-9f25-4f89-9159-5300585cc129 roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: system:openshift:controller:service-serving-cert-controller subjects: - kind: ServiceAccount name: service-serving-cert-controller namespace: openshift-infra $ oc get sa service-serving-cert-controller -n openshift-infra Error from server (NotFound): serviceaccounts "service-serving-cert-controller" not found
Expected results:
The serviceAccount called service-serving-cert-controller to exist or otherwise the ClusterRoleBinding to be removed.
Additional info:
Finding related to a Security review done on the OpenShift Container Platform 4 - Platform
In the python script used during bug pre-dispatch, we should align status and assignee between jira and github, keeping github as the source of truth:
this should happen after we add the ipv6 CI jobs
Description of problem:
The following binaries need to get extracted from the release payload for both rhel8 and rhel9: oc ccoctl opm openshift-install oc-mirror The images that contain these, should produce artifacts of both kinds in some locatiuon, and probably make the artifact of their architecture available under a normal location in path. Example: /usr/share/<binary>.rhel8 /usr/share/<binary>.rhel9 /usr/bin/<binary> This ticket is about getting "oc adm release extract" to do the right thing in a backwards compatible way. If both binaries are available get those. If not, get from the old location.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
Cluster and userworkload alertmanager instances inadvertenly become peered during upgrade
Version-Release number of selected component (if applicable):
How reproducible:
infrequently - customer observed this on 3 cluster out of 15
Steps to Reproduce:
Deploy userworkload monitoring ~~~ config.yaml: | enableUserWorkload: true prometheusK8s: ~~~ Deploy user workload alertmanager ~~~ name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | alertmanager: enabled: true ~~~ upgrade the cluster verify the state of the alertmanager clusters: ~~~ $ oc exec -n openshift-monitoring alertmanager-main-0 -- amtool cluster show -o json --alertmanager.url=http://localhost:9093 ~~~
Actual results:
alertmanager show 4 peers
Expected results:
we should have 2 pairs
Additional info:
Mitigation steps: Scaling down one of the alertmanager statefulsets to 0 and then scaling up again restores the expected configuration (i.e. 2 separate alertmanager clusters) - the customer then added networkpolicies to prevent alertmanager gossip between namespaces.
Please review the following PR: https://github.com/openshift/vertical-pod-autoscaler-operator/pull/149
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-22899. The following is the description of the original issue:
—
Description of problem:
In the self-managed HCP use case, if the on-premise baremetal management cluster does not have nodes labeled with the "topology.kubernetes.io/zone" key, then all HCP pods for a High Available cluster are scheduled to a single mgmt cluster node. This is a result of the way the affinity rules are constructed. Take the pod affinity/antiAffinity example below, which is generated for a HA HCP cluster. If the "topology.kubernetes.io/zone" label does not exist on the mgmt cluster nodes, then the pod will still get scheduled but that antiAffinity rule is effectively ignored. That seems odd due to the usage of the "requiredDuringSchedulingIgnoredDuringExecution" value, but I have tested this and the rule truly is ignored if the topologyKey is not present.
podAffinity: preferredDuringSchedulingIgnoredDuringExecution: - podAffinityTerm: labelSelector: matchLabels: hypershift.openshift.io/hosted-control-plane: clusters-vossel1 topologyKey: kubernetes.io/hostname weight: 100 podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchLabels: app: kube-apiserver hypershift.openshift.io/control-plane-component: kube-apiserver topologyKey: topology.kubernetes.io/zone
In the event that no "zones" are configured for the baremetal mgmt cluster, then the only other pod affinity rule is one that actually colocates the pods together. This results in a HA HCP having all the etcd, apiservers, etc... scheduled to a single node.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
100%
Steps to Reproduce:
1. Create a self-managed HA HCP cluster on a mgmt cluster with nodes that lack the "topology.kubernetes.io/zone" label
Actual results:
all HCP pods are scheduled to a single node.
Expected results:
HCP pods should always be spread across multiple nodes.
Additional info:
A way to address this is to add another anti-affinity rule which prevents every component from being scheduled on the same node as its replicas
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. Try to deploy in mad02 or mad04 with powervs 2. Cannot import boot image 3. fail
Actual results:
Fail
Expected results:
Cluster comes up
Additional info:
Please review the following PR: https://github.com/openshift/router/pull/513
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
In Reliability (loaded longrun, the load is stable) test, the 3 multus pods memory increased from <100 MiB to 700+MB in 7 days. The multus pods have requests memory: 65Mi, while there is no memory limit. If the test run for longer time and the memory keep increasing, this issue can impact the nodes' resource.
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-11-13-174800
How reproducible:
Met this the first time. I did not see this in 4.14's Reliability test.
Steps to Reproduce:
1. Install a AWS compact cluster with 3 masters, workers are on master nodes too. O 2. Run reliability-v2 test https://github.com/openshift/svt/tree/master/reliability-v2. The test will long run and simulate multiple customers usage on the cluster. config: 1 admin, 5 dev-test, 5 dev-prod, 1 dev-cron. 3. Monitor the metrics: container_memory_rss{container="kube-multus",namespace="openshift-multus"}
Actual results:
3 multus pods memory increased from <100 MiB to 700+MB in 7 days. After the test load stopped, the memory increase stopped, but didn't drop down.
Expected results:
memory should not continuous increase
Additional info:
% oc adm top pod -n openshift-multus --containers=true --sort-by memory -l app=multus POD NAME CPU(cores) MEMORY(bytes) multus-xp474 kube-multus 12m 1275Mi multus-xp474 POD 0m 0Mi multus-xt64s kube-multus 21m 971Mi multus-xt64s POD 0m 0Mi multus-d9xcs kube-multus 6m 757Mi multus-d9xcs POD 0m 0Mi
The monitoring screenshots:
multus-memory-increase-stop.png
Must-gather: must-gather.local.4628887688332215806.tar.gz
This is duplicate of https://issues.redhat.com/browse/ART-8361 one since on ART bugs we are not able to set `target` so creating the issue here.
Payload https://amd64.ocp.releases.ci.openshift.org/releasestream/4.15.0-0.nightly/release/4.15.0-0.nightly-2023-10-26-222533 failed on no successful runs of techpreview-serial. Looks like all failed on:
[sig-arch] events should not repeat pathologically for ns/openshift-dns
{ 2 events happened too frequently event happened 22 times, something is wrong: ns/openshift-dns service/dns-default hmsg/6f6ed749fd - pathological/true reason/TopologyAwareHintsDisabled Unable to allocate minimum required endpoints to each zone without exceeding overload threshold (4 endpoints, 2 zones), addressType: IPv4 From: 23:11:05Z To: 23:11:06Z result=reject event happened 23 times, something is wrong: ns/openshift-dns service/dns-default hmsg/6f6ed749fd - pathological/true reason/TopologyAwareHintsDisabled Unable to allocate minimum required endpoints to each zone without exceeding overload threshold (4 endpoints, 2 zones), addressType: IPv4 From: 23:11:06Z To: 23:11:07Z result=reject }
Please review the following PR: https://github.com/openshift/apiserver-network-proxy/pull/44
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
label missing for aws-ebs-csi-driver-operator in HCP Guest cluster
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-09-12-195514
How reproducible:
Always
Steps to Reproduce:
1. Install Hypershift kind cluster from flexy template aos-4_14/ipi-on-aws/versioned-installer-ovn-hypershift-ci oc get deployment/aws-ebs-csi-driver-operator -n clusters-hypershift-ci-3366 -o jsonpath='{.spec.template.metadata.labels}' {"name":"aws-ebs-csi-driver-operator"}
Actual results:
{"name":"aws-ebs-csi-driver-operator"}
Expected results:
need-management-kas-access
Additional info:
oc get deployment/cluster-storage-operator -n clusters-hypershift-ci-3366 -o jsonpath='{.spec.template.metadata.labels}' {"hypershift.openshift.io/hosted-control-plane":"clusters-hypershift-ci-3366","hypershift.openshift.io/need-management-kas-access":"true","name":"cluster-storage-operator"} Discussion: https://redhat-internal.slack.com/archives/GK0DA0JR5/p1694782231463969
Description of problem:
The configured accessTokenInactivityTimeout under tokenConfig in HostedCluster doesn't have any effect. 1. The value is not getting updated in oauth-openshift configmap 2. hostedcluster allows user to set accessTokenInactivityTimeout value < 300s, where as in master cluster the value should be > 300s.
Version-Release number of selected component (if applicable):
4.13
How reproducible:
Always
Steps to Reproduce:
1. Install a fresh 4.13 hypershift cluster 2. Configure accessTokenInactivityTimeout as below: $ oc edit hc -n clusters ... spec: configuration: oauth: identityProviders: ... tokenConfig: accessTokenInactivityTimeout: 100s ... 3. Check the hcp: $ oc get hcp -oyaml ... tokenConfig: accessTokenInactivityTimeout: 1m40s ... 4. Login to guest cluster with testuser-1 and get the token $ oc login https://a8890bba21c9b48d4a05096eee8d4edd-738276775c71fb8f.elb.us-east-2.amazonaws.com:6443 -u testuser-1 -p xxxxxxx $ TOKEN=`oc whoami -t` $ oc login --token="$TOKEN" WARNING: Using insecure TLS client config. Setting this option is not supported! Logged into "https://a8890bba21c9b48d4a05096eee8d4edd-738276775c71fb8f.elb.us-east-2.amazonaws.com:6443" as "testuser-1" using the token provided. You don't have any projects. You can try to create a new project, by running oc new-project <projectname>
Actual results:
1. hostedcluster will allow user to set the value < 300s for accessTokenInactivityTimeout which is not possible on master cluster. 2. The value is not updated in oauth-openshift configmap: $ oc get cm oauth-openshift -oyaml -n clusters-hypershift-ci-25785 ... tokenConfig: accessTokenMaxAgeSeconds: 86400 authorizeTokenMaxAgeSeconds: 300 ... 3. Login doesn't fail even if the user is not active for more than the set accessTokenInactivityTimeout seconds.
Expected results:
Login fails if the user is not active within the accessTokenInactivityTimeout seconds.
Description of problem:
The cloud-controller-manager operator can show garbage in its status: # oc get co cloud-controller-manager NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE cloud-controller-manager 4.14.0-0.nightly-arm64-2023-06-07-071657 True False True 58m Failed to resync for operator: 4.14.0-0.nightly-arm64-2023-06-07-071657 because &{%!e(string=failed to apply resources because TrustedCABundleControllerControllerDegraded condition is set to True)}
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-arm64-2023-06-07-071657
How reproducible:
always
Steps to Reproduce:
1. oc delete project openshift-cloud-controller-manager 2. wait a couple of minutes 3. oc get co openshift-cloud-controller-manager
Actual results:
Failed to resync for operator: 4.14.0-0.nightly-arm64-2023-06-07-071657 because &{%!e(string=failed to apply resources because TrustedCABundleControllerControllerDegraded condition is set to True)}
Expected results:
A helpful error message
Additional info:
This is a clone of issue OCPBUGS-29104. The following is the description of the original issue:
—
Description of problem:
Only customers have a break-glass certificate signer.
Version-Release number of selected component (if applicable):
4.16.0
How reproducible:
Always
Steps to Reproduce:
1.create CSR with any other signer chosen 2.does not work 3.
Actual results:
does not work
Expected results:
should work
Additional info:
Please review the following PR: https://github.com/openshift/csi-driver-shared-resource-operator/pull/84
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-28659. The following is the description of the original issue:
—
Description of problem:
The ValidatingAdmissionPolicy admission plugin is set in OpenShift 4.14+ kube-apiserver config, but is missing from the HyperShift config. It should be set.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
4.15: https://github.com/openshift/hypershift/blob/release-4.15/control-plane-operator/controllers/hostedcontrolplane/kas/config.go#L293-L341 4.14: https://github.com/openshift/hypershift/blob/release-4.14/control-plane-operator/controllers/hostedcontrolplane/kas/config.go#L283-L331
Expected results:
Expect to see ValidatingAdmissionPolicy
Additional info:
Please review the following PR: https://github.com/openshift/cluster-update-keys/pull/51
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-25025. The following is the description of the original issue:
—
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/aws-ebs-csi-driver/pull/245
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-24041. The following is the description of the original issue:
—
Seen in 4.15-related update CI:
$ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/console.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False[^:]*: \(.*\)|\1 \2 \3|' | sed 's|[.]apps[.][^ /]*|.apps...|g' | sort | uniq -c | sort -n 1 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... dial tcp 52.158.160.194:443: connect: connection refused 1 console RouteHealth_StatusError route not yet available, https://console-openshift-console.apps... returns '503 Service Unavailable' 2 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... dial tcp: lookup console-openshift-console.apps... on 172.30.0.10:53: no such host 2 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... EOF 8 console RouteHealth_RouteNotAdmitted console route is not admitted 16 console RouteHealth_FailedGet failed to GET route (https://console-openshift-console.apps... Get "https://console-openshift-console.apps... context deadline exceeded (Client.Timeout exceeded while awaiting headers)
For example this 4.14 to 4.15 run had:
: [bz-Management Console] clusteroperator/console should not change condition/Available Run #0: Failed 1h25m23s { 1 unexpected clusteroperator state transitions during e2e test run Nov 28 03:42:41.207 - 1s E clusteroperator/console condition/Available reason/RouteHealth_FailedGet status/False RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ci-op-d2qsp1gp-2a31d.aws-2.ci.openshift.org): Get "https://console-openshift-console.apps.ci-op-d2qsp1gp-2a31d.aws-2.ci.openshift.org": context deadline exceeded (Client.Timeout exceeded while awaiting headers)}
While a timeout for console Route isn't fantastic, an issue that only persists for 1s is not long enough to warrant immediate admin intervention. Teaching the console operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.
At least 4.15. Possibly other versions; I haven't checked.
.h2 How reproducible
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/console.*condition/Available.*status/False' | grep 'periodic.*failures match' | sort periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 12 runs, 17% failed, 50% of failures match = 8% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-ppc64le (all) - 5 runs, 20% failed, 100% of failures match = 20% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-s390x (all) - 4 runs, 100% failed, 25% of failures match = 25% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 12 runs, 17% failed, 100% of failures match = 17% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 7 runs, 29% failed, 50% of failures match = 14% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 12 runs, 25% failed, 33% of failures match = 8% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 23% failed, 28% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 28% failed, 23% of failures match = 6% impact periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-aws-ovn-upgrade (all) - 63 runs, 38% failed, 8% of failures match = 3% impact periodic-ci-openshift-release-master-ci-4.16-upgrade-from-stable-4.15-e2e-azure-sdn-upgrade (all) - 60 runs, 73% failed, 11% of failures match = 8% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 70 runs, 7% failed, 20% of failures match = 1% impact
Seems like it's primarily minor-version updates that trip this, and in jobs with high run counts, the impact percentage is single-digits.
There may be a way to reliable trigger these hiccups, but as a reproducer floor, running days of CI and checking to see whether impact percentages decrease would be a good way to test fixes post-merge.
Lots of console ClusterOperator going Available=False blips in 4.15 update CI.
Console goes Available=False if and only if immediate admin intervention is appropriate.
This is a clone of issue OCPBUGS-23228. The following is the description of the original issue:
—
Release controller > 4.14.2 > HyperShift conformance run > gathered assets:
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance/1722648207965556736/artifacts/e2e-aws-ovn-conformance/dump/artifacts/namespaces/clusters-e8d2a8003773eacb6a8b/core/pods/logs/kube-apiserver-5f47c7b667-42h2f-audit-logs.log | grep -v '/var/log/kube-apiserver/audit.log. has' | jq -r 'select(.user.username == "system:admin" and .verb == "create" and .requestURI == "/apis/operator.openshift.io/v1/storages") | .userAgent' | sort | uniq -c 65 hosted-cluster-config-operator-manager $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance/1722648207965556736/artifacts/e2e-aws-ovn-conformance/dump/artifacts/namespaces/clusters-e8d2a8003773eacb6a8b/core/pods/logs/kube-apiserver-5f47c7b667-42h2f-audit-logs.log | grep -v '/var/log/kube-apiserver/audit.log. has' | jq -r 'select(.user.username == "system:admin" and .verb == "create" and .requestURI == "/apis/operator.openshift.io/v1/storages") | .requestReceivedTimestamp + " " + (.responseStatus | (.code | tostring) + " " + .reason)' | head -n5 2023-11-09T17:17:15.130454Z 409 AlreadyExists 2023-11-09T17:17:15.163256Z 409 AlreadyExists 2023-11-09T17:17:15.198908Z 409 AlreadyExists 2023-11-09T17:17:15.230532Z 409 AlreadyExists 2023-11-09T17:17:22.899579Z 409 AlreadyExists
That's banging away pretty hard with creation attempts that keep getting 409ed, presumably because an earlier creation attempt succeeded. If the controller needs very quick latency in re-creation, perhaps an informing watch? If the controller can handle some re-creation latency, perhaps a quieter poll?
4.14.2. I haven't checked other releases.
Likely 100%. I saw similar behavior in an unrelated dump, and confirmed the busy 409s in the first CI run I checked.
1. Dump a hosted cluster.
2. Inspect its audit logs for hosted-cluster-config-operator-manager create activity.
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-hypershift-release-4.14-periodics-e2e-aws-ovn-conformance/1722648207965556736/artifacts/e2e-aws-ovn-conformance/dump/artifacts/namespaces/clusters-e8d2a8003773eacb6a8b/core/pods/logs/kube-apiserver-5f47c7b667-42h2f-audit-logs.log | grep -v '/var/log/kube-apiserver/audit.log. has' | jq -r 'select(.userAgent == "hosted-cluster-config-operator-manager" and .verb == "create") | .verb + " " + (.responseStatus.code | tostring)' | sort | uniq -c 130 create 409
Zero or rare 409 creation request from this user-agent.
The user agent seems to be defined here, so likely the fix will involve changes to that manager.
Description of problem:
When setting up transient mounts, which are used for exposing CA certificates and RPM package repositories to a build, a recent change we made in the builder attempted to replace simple bind mounts with overlay mounts. While this might have made things easier for unprivileged builds, we overlooked that overlay mounts can't be made to files, only directories, so we need to revert the change.
Version-Release number of selected component (if applicable):
4.14.0
How reproducible:
Always
Steps to Reproduce:
Per https://redhat-internal.slack.com/archives/C014MHHKUSF/p1696882408656359?thread_ts=1696882334.352129&cid=C014MHHKUSF, 1. oc new-app - l app=pvg-nodejs --name pvg-nodejs pvg-nodejs https://github.com/openshift/nodejs-ex.git
Actual results:
mount /var/lib/containers/storage/overlay-containers/9c3877f3062cc18b01f30db310e0e2bd0a1cd4527d74f41c313399e48fa81d23/userdata/overlay/145259665/merge:/run/secrets/redhat.repo (via /proc/self/fd/6), data: lowerdir=/tmp/redhat.repo-copy2014834134/redhat.repo,upperdir=/var/lib/containers/storage/overlay-containers/9c3877f3062cc18b01f30db310e0e2bd0a1cd4527d74f41c313399e48fa81d23/userdata/overlay/145259665/upper,workdir=/var/lib/containers/storage/overlay-containers/9c3877f3062cc18b01f30db310e0e2bd0a1cd4527d74f41c313399e48fa81d23/userdata/overlay/145259665/work: *invalid argument*"
Expected results:
Successful setup for a transient mount to the redhat.repo file for a RUN instruction.
Additional info:
Bug introduced in https://github.com/openshift/builder/pull/349, should be fixed in https://github.com/openshift/builder/pull/359.
Description of problem:
Picked up 4.14-ec-4 (which uses cgroups v1 as default) and trying to create a cluster with following PerformanceProfile (and corresponding mcp) by placing them in the manifests folder,
apiVersion: performance.openshift.io/v2 kind: PerformanceProfile metadata: name: clusterbotpp spec: cpu: isolated: "1-3" reserved: "0" realTimeKernel: enabled: false nodeSelector: node-role.kubernetes.io/worker: "" machineConfigPoolSelector: pools.operator.machineconfiguration.openshift.io/worker: ""
and,
apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: worker spec: machineConfigSelector: matchLabels: machineconfiguration.openshift.io/role: worker nodeSelector: matchLabels: node-role.kubernetes.io/worker: ""
The cluster often fails to install because bootkube spends a lot of time chasing this error,
Sep 06 18:32:43 ip-10-0-145-107 bootkube.sh[4925]: Created "clusterbotpp_kubeletconfig.yaml" kubeletconfigs.v1.machineconfiguration.openshift.io/performance-clusterbotpp -n Sep 06 18:32:43 ip-10-0-145-107 bootkube.sh[4925]: Failed to update status for the "clusterbotpp_kubeletconfig.yaml" kubeletconfigs.v1.machineconfiguration.openshift.io/performance-clusterbotpp -n : Operation cannot be fulfilled on kubeletconfigs.machineconfiguration.openshift.io "performance-clusterbotpp": StorageError: invalid object, Code: 4, Key: /kubernetes.io/machineconfiguration.openshift.io/kubeletconfigs/performance-clusterbotpp, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 11f98d74-af1b-4a4c-9692-6dce56ee5cd9, UID in object meta: Sep 06 18:32:43 ip-10-0-145-107 bootkube.sh[4925]: [#1717] failed to create some manifests: Sep 06 18:32:43 ip-10-0-145-107 bootkube.sh[4925]: "clusterbotpp_kubeletconfig.yaml": failed to update status for kubeletconfigs.v1.machineconfiguration.openshift.io/performance-clusterbotpp -n : Operation cannot be fulfilled on kubeletconfigs.machineconfiguration.openshift.io "performance-clusterbotpp": StorageError: invalid object, Code: 4, Key: /kubernetes.io/machineconfiguration.openshift.io/kubeletconfigs/performance-clusterbotpp, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 11f98d74-af1b-4a4c-9692-6dce56ee5cd9, UID in object meta: Sep 06 18:32:43 ip-10-0-145-107 bootkube.sh[4925]: Created "clusterbotpp_kubeletconfig.yaml" kubeletconfigs.v1.machineconfiguration.openshift.io/performance-clusterbotpp -n Sep 06 18:32:43 ip-10-0-145-107 bootkube.sh[4925]: Failed to update status for the "clusterbotpp_kubeletconfig.yaml" kubeletconfigs.v1.machineconfiguration.openshift.io/performance-clusterbotpp -n : Operation cannot be fulfilled on kubeletconfigs.machineconfiguration.openshift.io "performance-clusterbotpp": StorageError: invalid object, Code: 4, Key: /kubernetes.io/machineconfiguration.openshift.io/kubeletconfigs/performance-clusterbotpp, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 597dfcf3-012d-4730-912a-78efabb920ba, UID in object meta:
This leads to worker nodes not getting ready in time, which leads to installer marking the cluster installation failed. Ironically, even after the cluster installer returns with failure, if you wait long enough (sometimes) I have observed the cluster eventually reconciles and the worker nodes get provisioned.
I am attaching the installation logs from one such run with this issue.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Often
Steps to Reproduce:
1. Try to install new cluster by placing PeformanceProfile in the manifests folder 2. 3.
Actual results:
Cluster installation failed.
Expected results:
Cluster installation should succeed.
Additional info:
Also, I didn't observe this occurring in 4.13.9.
Description of problem:
Failed to install cluster on SC2S region as: level=error msg=Error: reading Security Group (sg-0b0cd054dd599602f) Rules: UnsupportedOperation: The functionality you requested is not available in this region.
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-09-11-201102
How reproducible:
Always
Steps to Reproduce:
1. Create an OCP cluster on SC2S
Actual results:
Install fail: level=error msg=Error: reading Security Group (sg-0b0cd054dd599602f) Rules: UnsupportedOperation: The functionality you requested is not available in this region.
Expected results:
Install succeed.
Additional info:
* C2S region is not affected
Reduce shared informer memory usage by stripping object fields we don't care about.
Description of problem:
After build02 is upgraded to 4.16.0-ec.4 from 4.16.0-ec.3, the CSRs are not auto-approved. As a result, provisioned machines cannot become nodes of the cluster.
Version-Release number of selected component (if applicable):
oc --context build02 get clusterversion version NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-ec.4 True False 4h28m
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Michael McCune feels the group "system:serviceaccounts" was missing in the CSR.
https://redhat-internal.slack.com/archives/CBZHF4DHC/p1710875084740869?thread_ts=1710861842.471739&cid=CBZHF4DHC
An inspection of the namespace openshift-cluster-machine-approver:
https://redhat-internal.slack.com/archives/CBZHF4DHC/p1710863462860809?thread_ts=1710861842.471739&cid=CBZHF4DHC
A workaround to approve the CSRs manually on b02:
https://github.com/openshift/release/pull/50016
Please review the following PR: https://github.com/openshift/cloud-provider-gcp/pull/47
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Please review the following PR: https://github.com/openshift/machine-os-images/pull/34
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
From profiling on cert rotation we know that the node informer is called every couple of seconds on node heartbeats. This PR will ensure that all our node listers only ever listen/inform on the master node updates to reduce the frequency of unnecessary sync calls. Also related to the issue, increasing the amount of node status updates: OCPBUGS-29713 OCPBUGS-29424
Version-Release number of selected component (if applicable):
4.16 down to 4.12, we need to check all versions
How reproducible:
always
Steps to Reproduce:
1. create a cluster 2. look at some metric (eg sum(rate(apiserver_request_total{resource="nodes"}[5m])))) 3. observe some improvement over previous state
Actual results:
increased amount of CPU usage for CEO / QPS to apiserver
Expected results:
less amount of CPU consumed for CEO / QPS to apiserver
Additional info:
already fixed in 4.16 with https://github.com/openshift/cluster-etcd-operator/pull/1205 creating this ticket for backporting
Description of problem:
Installing ipv6 agent-based hosted cluster in disconnected environment. The hosted control plane is available but when using its kubeconfig to run oc commands on the hosted cluster, I'm getting E1009 08:05:34.000946 115216 memcache.go:265] couldn't get current server API group list: Get "https://fd2e:6f44:5dd8::58:31765/api?timeout=32s": dial tcp [fd2e:6f44:5dd8::58]:31765: i/o timeout
Version-Release number of selected component (if applicable):
OCP 4.14.0-rc.4
How reproducible:
100%
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
I can use oc commands against the hosted cluster
Additional info:
Description of problem:
Enable IPSec pre/post install on OVN IC cluster $ oc patch networks.operator.openshift.io cluster --type=merge -p '{"spec":{"defaultNetwork":{"ovnKubernetesConfig":{"ipsecConfig":{ }}}}}' network.operator.openshift.io/cluster patched ovn-ipsec containers complaining: ovs-monitor-ipsec | ERR | Failed to import certificate into NSS. b'certutil: unable to open "/etc/openvswitch/keys/ipsec-cacert.pem" for reading (-5950, 2).\n' $ oc rsh ovn-ipsec-d7rx9 Defaulted container "ovn-ipsec" out of: ovn-ipsec, ovn-keys (init) sh-5.1# certutil -L -d /var/lib/ipsec/nss Certificate Nickname Trust Attributes SSL,S/MIME,JAR/XPIovs_certkey_db961f9a-7de4-4f1d-a2fb-a8306d4079c5 u,u,u sh-5.1# cat /var/log/openvswitch/libreswan.log Aug 4 15:12:46.808394: Initializing NSS using read-write database "sql:/var/lib/ipsec/nss" Aug 4 15:12:46.837350: FIPS Mode: NO Aug 4 15:12:46.837370: NSS crypto library initialized Aug 4 15:12:46.837387: FIPS mode disabled for pluto daemon Aug 4 15:12:46.837390: FIPS HMAC integrity support [disabled] Aug 4 15:12:46.837541: libcap-ng support [enabled] Aug 4 15:12:46.837550: Linux audit support [enabled] Aug 4 15:12:46.837576: Linux audit activated Aug 4 15:12:46.837580: Starting Pluto (Libreswan Version 4.9 IKEv2 IKEv1 XFRM XFRMI esp-hw-offload FORK PTHREAD_SETSCHEDPRIO GCC_EXCEPTIONS NSS (IPsec profile) (NSS-KDF) DNSSEC SYSTEMD_WATCHDOG LABELED_IPSEC (SELINUX) SECCOMP LIBCAP_NG LINUX_AUDIT AUTH_PAM NETWORKMANAGER CURL(non-NSS) LDAP(non-NSS)) pid:147 Aug 4 15:12:46.837583: core dump dir: /run/pluto Aug 4 15:12:46.837585: secrets file: /etc/ipsec.secrets Aug 4 15:12:46.837587: leak-detective enabled Aug 4 15:12:46.837589: NSS crypto [enabled] Aug 4 15:12:46.837591: XAUTH PAM support [enabled] Aug 4 15:12:46.837604: initializing libevent in pthreads mode: headers: 2.1.12-stable (2010c00); library: 2.1.12-stable (2010c00) Aug 4 15:12:46.837664: NAT-Traversal support [enabled] Aug 4 15:12:46.837803: Encryption algorithms: Aug 4 15:12:46.837814: AES_CCM_16 {256,192,*128} IKEv1: ESP IKEv2: ESP FIPS aes_ccm, aes_ccm_c Aug 4 15:12:46.837820: AES_CCM_12 {256,192,*128} IKEv1: ESP IKEv2: ESP FIPS aes_ccm_b Aug 4 15:12:46.837826: AES_CCM_8 {256,192,*128} IKEv1: ESP IKEv2: ESP FIPS aes_ccm_a Aug 4 15:12:46.837831: 3DES_CBC [*192] IKEv1: IKE ESP IKEv2: IKE ESP FIPS NSS(CBC) 3des Aug 4 15:12:46.837837: CAMELLIA_CTR {256,192,*128} IKEv1: ESP IKEv2: ESP Aug 4 15:12:46.837843: CAMELLIA_CBC {256,192,*128} IKEv1: IKE ESP IKEv2: IKE ESP NSS(CBC) camellia Aug 4 15:12:46.837849: AES_GCM_16 {256,192,*128} IKEv1: ESP IKEv2: IKE ESP FIPS NSS(GCM) aes_gcm, aes_gcm_c Aug 4 15:12:46.837855: AES_GCM_12 {256,192,*128} IKEv1: ESP IKEv2: IKE ESP FIPS NSS(GCM) aes_gcm_b Aug 4 15:12:46.837861: AES_GCM_8 {256,192,*128} IKEv1: ESP IKEv2: IKE ESP FIPS NSS(GCM) aes_gcm_a Aug 4 15:12:46.837867: AES_CTR {256,192,*128} IKEv1: IKE ESP IKEv2: IKE ESP FIPS NSS(CTR) aesctr Aug 4 15:12:46.837872: AES_CBC {256,192,*128} IKEv1: IKE ESP IKEv2: IKE ESP FIPS NSS(CBC) aes Aug 4 15:12:46.837878: NULL_AUTH_AES_GMAC {256,192,*128} IKEv1: ESP IKEv2: ESP FIPS aes_gmac Aug 4 15:12:46.837883: NULL [] IKEv1: ESP IKEv2: ESP Aug 4 15:12:46.837889: CHACHA20_POLY1305 [*256] IKEv1: IKEv2: IKE ESP NSS(AEAD) chacha20poly1305 Aug 4 15:12:46.837892: Hash algorithms: Aug 4 15:12:46.837896: MD5 IKEv1: IKE IKEv2: NSS Aug 4 15:12:46.837901: SHA1 IKEv1: IKE IKEv2: IKE FIPS NSS sha Aug 4 15:12:46.837906: SHA2_256 IKEv1: IKE IKEv2: IKE FIPS NSS sha2, sha256 Aug 4 15:12:46.837910: SHA2_384 IKEv1: IKE IKEv2: IKE FIPS NSS sha384 Aug 4 15:12:46.837915: SHA2_512 IKEv1: IKE IKEv2: IKE FIPS NSS sha512 Aug 4 15:12:46.837919: IDENTITY IKEv1: IKEv2: FIPS Aug 4 15:12:46.837922: PRF algorithms: Aug 4 15:12:46.837927: HMAC_MD5 IKEv1: IKE IKEv2: IKE native(HMAC) md5 Aug 4 15:12:46.837931: HMAC_SHA1 IKEv1: IKE IKEv2: IKE FIPS NSS sha, sha1 Aug 4 15:12:46.837936: HMAC_SHA2_256 IKEv1: IKE IKEv2: IKE FIPS NSS sha2, sha256, sha2_256 Aug 4 15:12:46.837950: HMAC_SHA2_384 IKEv1: IKE IKEv2: IKE FIPS NSS sha384, sha2_384 Aug 4 15:12:46.837955: HMAC_SHA2_512 IKEv1: IKE IKEv2: IKE FIPS NSS sha512, sha2_512 Aug 4 15:12:46.837959: AES_XCBC IKEv1: IKEv2: IKE native(XCBC) aes128_xcbc Aug 4 15:12:46.837962: Integrity algorithms: Aug 4 15:12:46.837966: HMAC_MD5_96 IKEv1: IKE ESP AH IKEv2: IKE ESP AH native(HMAC) md5, hmac_md5 Aug 4 15:12:46.837984: HMAC_SHA1_96 IKEv1: IKE ESP AH IKEv2: IKE ESP AH FIPS NSS sha, sha1, sha1_96, hmac_sha1 Aug 4 15:12:46.837995: HMAC_SHA2_512_256 IKEv1: IKE ESP AH IKEv2: IKE ESP AH FIPS NSS sha512, sha2_512, sha2_512_256, hmac_sha2_512 Aug 4 15:12:46.837999: HMAC_SHA2_384_192 IKEv1: IKE ESP AH IKEv2: IKE ESP AH FIPS NSS sha384, sha2_384, sha2_384_192, hmac_sha2_384 Aug 4 15:12:46.838005: HMAC_SHA2_256_128 IKEv1: IKE ESP AH IKEv2: IKE ESP AH FIPS NSS sha2, sha256, sha2_256, sha2_256_128, hmac_sha2_256 Aug 4 15:12:46.838008: HMAC_SHA2_256_TRUNCBUG IKEv1: ESP AH IKEv2: AH Aug 4 15:12:46.838014: AES_XCBC_96 IKEv1: ESP AH IKEv2: IKE ESP AH native(XCBC) aes_xcbc, aes128_xcbc, aes128_xcbc_96 Aug 4 15:12:46.838018: AES_CMAC_96 IKEv1: ESP AH IKEv2: ESP AH FIPS aes_cmac Aug 4 15:12:46.838023: NONE IKEv1: ESP IKEv2: IKE ESP FIPS null Aug 4 15:12:46.838026: DH algorithms: Aug 4 15:12:46.838031: NONE IKEv1: IKEv2: IKE ESP AH FIPS NSS(MODP) null, dh0 Aug 4 15:12:46.838035: MODP1536 IKEv1: IKE ESP AH IKEv2: IKE ESP AH NSS(MODP) dh5 Aug 4 15:12:46.838039: MODP2048 IKEv1: IKE ESP AH IKEv2: IKE ESP AH FIPS NSS(MODP) dh14 Aug 4 15:12:46.838044: MODP3072 IKEv1: IKE ESP AH IKEv2: IKE ESP AH FIPS NSS(MODP) dh15 Aug 4 15:12:46.838048: MODP4096 IKEv1: IKE ESP AH IKEv2: IKE ESP AH FIPS NSS(MODP) dh16 Aug 4 15:12:46.838053: MODP6144 IKEv1: IKE ESP AH IKEv2: IKE ESP AH FIPS NSS(MODP) dh17 Aug 4 15:12:46.838057: MODP8192 IKEv1: IKE ESP AH IKEv2: IKE ESP AH FIPS NSS(MODP) dh18 Aug 4 15:12:46.838061: DH19 IKEv1: IKE IKEv2: IKE ESP AH FIPS NSS(ECP) ecp_256, ecp256 Aug 4 15:12:46.838066: DH20 IKEv1: IKE IKEv2: IKE ESP AH FIPS NSS(ECP) ecp_384, ecp384 Aug 4 15:12:46.838070: DH21 IKEv1: IKE IKEv2: IKE ESP AH FIPS NSS(ECP) ecp_521, ecp521 Aug 4 15:12:46.838074: DH31 IKEv1: IKE IKEv2: IKE ESP AH NSS(ECP) curve25519 Aug 4 15:12:46.838077: IPCOMP algorithms: Aug 4 15:12:46.838081: DEFLATE IKEv1: ESP AH IKEv2: ESP AH FIPS Aug 4 15:12:46.838085: LZS IKEv1: IKEv2: ESP AH FIPS Aug 4 15:12:46.838089: LZJH IKEv1: IKEv2: ESP AH FIPS Aug 4 15:12:46.838093: testing CAMELLIA_CBC: Aug 4 15:12:46.838096: Camellia: 16 bytes with 128-bit key Aug 4 15:12:46.838162: Camellia: 16 bytes with 128-bit key Aug 4 15:12:46.838201: Camellia: 16 bytes with 256-bit key Aug 4 15:12:46.838243: Camellia: 16 bytes with 256-bit key Aug 4 15:12:46.838280: testing AES_GCM_16: Aug 4 15:12:46.838284: empty string Aug 4 15:12:46.838319: one block Aug 4 15:12:46.838352: two blocks Aug 4 15:12:46.838385: two blocks with associated data Aug 4 15:12:46.838424: testing AES_CTR: Aug 4 15:12:46.838428: Encrypting 16 octets using AES-CTR with 128-bit key Aug 4 15:12:46.838464: Encrypting 32 octets using AES-CTR with 128-bit key Aug 4 15:12:46.838502: Encrypting 36 octets using AES-CTR with 128-bit key Aug 4 15:12:46.838541: Encrypting 16 octets using AES-CTR with 192-bit key Aug 4 15:12:46.838576: Encrypting 32 octets using AES-CTR with 192-bit key Aug 4 15:12:46.838613: Encrypting 36 octets using AES-CTR with 192-bit key Aug 4 15:12:46.838651: Encrypting 16 octets using AES-CTR with 256-bit key Aug 4 15:12:46.838687: Encrypting 32 octets using AES-CTR with 256-bit key Aug 4 15:12:46.838724: Encrypting 36 octets using AES-CTR with 256-bit key Aug 4 15:12:46.838763: testing AES_CBC: Aug 4 15:12:46.838766: Encrypting 16 bytes (1 block) using AES-CBC with 128-bit key Aug 4 15:12:46.838801: Encrypting 32 bytes (2 blocks) using AES-CBC with 128-bit key Aug 4 15:12:46.838841: Encrypting 48 bytes (3 blocks) using AES-CBC with 128-bit key Aug 4 15:12:46.838881: Encrypting 64 bytes (4 blocks) using AES-CBC with 128-bit key Aug 4 15:12:46.838928: testing AES_XCBC: Aug 4 15:12:46.838932: RFC 3566 Test Case 1: AES-XCBC-MAC-96 with 0-byte input Aug 4 15:12:46.839126: RFC 3566 Test Case 2: AES-XCBC-MAC-96 with 3-byte input Aug 4 15:12:46.839291: RFC 3566 Test Case 3: AES-XCBC-MAC-96 with 16-byte input Aug 4 15:12:46.839444: RFC 3566 Test Case 4: AES-XCBC-MAC-96 with 20-byte input Aug 4 15:12:46.839600: RFC 3566 Test Case 5: AES-XCBC-MAC-96 with 32-byte input Aug 4 15:12:46.839756: RFC 3566 Test Case 6: AES-XCBC-MAC-96 with 34-byte input Aug 4 15:12:46.839937: RFC 3566 Test Case 7: AES-XCBC-MAC-96 with 1000-byte input Aug 4 15:12:46.840373: RFC 4434 Test Case AES-XCBC-PRF-128 with 20-byte input (key length 16) Aug 4 15:12:46.840529: RFC 4434 Test Case AES-XCBC-PRF-128 with 20-byte input (key length 10) Aug 4 15:12:46.840698: RFC 4434 Test Case AES-XCBC-PRF-128 with 20-byte input (key length 18) Aug 4 15:12:46.840990: testing HMAC_MD5: Aug 4 15:12:46.840997: RFC 2104: MD5_HMAC test 1 Aug 4 15:12:46.841200: RFC 2104: MD5_HMAC test 2 Aug 4 15:12:46.841390: RFC 2104: MD5_HMAC test 3 Aug 4 15:12:46.841582: testing HMAC_SHA1: Aug 4 15:12:46.841585: CAVP: IKEv2 key derivation with HMAC-SHA1 Aug 4 15:12:46.842055: 8 CPU cores online Aug 4 15:12:46.842062: starting up 7 helper threads Aug 4 15:12:46.842128: started thread for helper 0 Aug 4 15:12:46.842174: helper(1) seccomp security disabled for crypto helper 1 Aug 4 15:12:46.842188: started thread for helper 1 Aug 4 15:12:46.842219: helper(2) seccomp security disabled for crypto helper 2 Aug 4 15:12:46.842236: started thread for helper 2 Aug 4 15:12:46.842258: helper(3) seccomp security disabled for crypto helper 3 Aug 4 15:12:46.842269: started thread for helper 3 Aug 4 15:12:46.842296: helper(4) seccomp security disabled for crypto helper 4 Aug 4 15:12:46.842311: started thread for helper 4 Aug 4 15:12:46.842323: helper(5) seccomp security disabled for crypto helper 5 Aug 4 15:12:46.842346: started thread for helper 5 Aug 4 15:12:46.842369: helper(6) seccomp security disabled for crypto helper 6 Aug 4 15:12:46.842376: started thread for helper 6 Aug 4 15:12:46.842390: using Linux xfrm kernel support code on #1 SMP PREEMPT_DYNAMIC Thu Jul 20 09:11:28 EDT 2023 Aug 4 15:12:46.842393: helper(7) seccomp security disabled for crypto helper 7 Aug 4 15:12:46.842707: selinux support is NOT enabled. Aug 4 15:12:46.842728: systemd watchdog not enabled - not sending watchdog keepalives Aug 4 15:12:46.843813: seccomp security disabled Aug 4 15:12:46.848083: listening for IKE messages Aug 4 15:12:46.848252: Kernel supports NIC esp-hw-offload Aug 4 15:12:46.848534: adding UDP interface ovn-k8s-mp0 10.129.0.2:500 Aug 4 15:12:46.848624: adding UDP interface ovn-k8s-mp0 10.129.0.2:4500 Aug 4 15:12:46.848654: adding UDP interface br-ex 169.254.169.2:500 Aug 4 15:12:46.848681: adding UDP interface br-ex 169.254.169.2:4500 Aug 4 15:12:46.848713: adding UDP interface br-ex 10.0.0.8:500 Aug 4 15:12:46.848740: adding UDP interface br-ex 10.0.0.8:4500 Aug 4 15:12:46.848767: adding UDP interface lo 127.0.0.1:500 Aug 4 15:12:46.848793: adding UDP interface lo 127.0.0.1:4500 Aug 4 15:12:46.848824: adding UDP interface lo [::1]:500 Aug 4 15:12:46.848853: adding UDP interface lo [::1]:4500 Aug 4 15:12:46.851160: loading secrets from "/etc/ipsec.secrets" Aug 4 15:12:46.851214: no secrets filename matched "/etc/ipsec.d/*.secrets" Aug 4 15:12:47.053369: loading secrets from "/etc/ipsec.secrets" sh-4.4# tcpdump -i any esp dropped privs to tcpdump tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes^C 0 packets capturedsh-5.1# ovn-nbctl --no-leader-only get nb_global . ipsec false
Version-Release number of selected component (if applicable):
openshift/cluster-network-operator#1874
How reproducible:
Always
Steps to Reproduce:
1.Install OVN cluster and enable IPSec in runtime 2. 3.
Actual results:
no esp packets seen across the nodes
Expected results:
esp traffic should be seen across the nodes
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
pre-merge testing or 4.14.0-0.nightly-2023-08-20-085537
How reproducible:
Always
Steps to Reproduce:
1. Label one worker node as egress node and enable ipforarding on it 2. Create an egressip object, it can be assigned to egress node oc get egressip NAME EGRESSIPS ASSIGNED NODE ASSIGNED EGRESSIPS egressip-1 172.22.0.100 worker-2.sriov.openshift-qe.sdn.com 172.22.0.100 oc get egressip -o yaml apiVersion: v1 items: - apiVersion: k8s.ovn.org/v1 kind: EgressIP metadata: creationTimestamp: "2023-08-11T03:46:19Z" generation: 7 name: egressip-1 resourceVersion: "169277" uid: 7692bea5-c072-41e5-aa7a-acfa737a5428 spec: egressIPs: - 172.22.0.100 namespaceSelector: matchLabels: name: qe status: items: - egressIP: 172.22.0.100 node: worker-2.sriov.openshift-qe.sdn.com kind: List metadata: resourceVersion: "" 3. Create a namespace test and some pods on it. add a label to namespace matching egressIP object. 4. From pod to access the bastion host
Actual results:
Outgoing traffic was timeout From bastion node,it didn't get correct MAC for egressIP ? (172.22.0.100) at <incomplete> on sriovpr egressIP was not added to secondary NIC on egress node oc debug node/worker-2.sriov.openshift-qe.sdn.com Temporary namespace openshift-debug-crpt9 is created for debugging node... Starting pod/worker-2sriovopenshift-qesdncom-debug-s857l ... To use host binaries, run `chroot /host` Pod IP: 192.168.111.25 If you don't see a command prompt, try pressing enter. sh-4.4# ip a show enp1s0 2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000 link/ether 00:32:ca:4e:a8:bf brd ff:ff:ff:ff:ff:ff inet 172.22.0.50/24 scope global enp1s0 valid_lft forever preferred_lft forever inet6 fd00:1101::65fe:9a70:ab40:4c1a/128 scope global dynamic noprefixroute valid_lft 85269sec preferred_lft 85269sec inet6 fe80::232:caff:fe4e:a8bf/64 scope link noprefixroute valid_lft forever preferred_lft forever
Expected results:
EgressIP works well on secondary NIC
Additional info:
Description of problem:
Go to Home -> Events page, type string in filter field, the events are not filtered. (The search mode is fuzzy search by default)
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-08-28-154013
How reproducible:
Always
Steps to Reproduce:
1.Go to Home -> Events page, type string in filter field, 2. 3.
Actual results:
1. The events are not filtered.
Expected results:
1. Should filter out events containing the filter string.
Additional info:
Type filter could work on events page.
Description of problem:
AWS KMS on HyperShift makes use of two UNIX sockets via which the KMS plugins are run. Each unix socket should run connect to independent KMS instances i.e. with their own AWS ARNs. However, as of today both the active KMS socket as well as the backup KMS socket seem to be using the same ARN which essentially translates that the backup KMS instance never gets used.
Version-Release number of selected component (if applicable):
HyperShift - main branch (PR #423) GitHub indicates all the following hypershift versions would be affected. v0.1.15, v0.1.14, v0.1.13, v0.1.12, v0.1.11, v0.1.10, v0.1.9, v0.1.8, v0.1.7, v0.1.6, v0.1.5, v0.1.4, v0.1.3, v0.1.2, v0.1.1, v0.1.0, 2.0.0-20220406093220, 2.0.0-20220323110745, 2.0.0-20220319120001, 2.0.0-20220317155435
How reproducible:
Always
Steps to Reproduce:
1. By creating a HyperShift cluster 2. Checking if backup KMS instance was ever used
Actual results:
Active KMS instance's ARN is used even by the backup KMS socket
Expected results:
Backup KMS socket should use it's own backupKey.ARN
Additional info:
should use backupKey.ARN instead of activeKey.ARN in the func call
Please review the following PR: https://github.com/openshift/cluster-control-plane-machine-set-operator/pull/266
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This Downstream PR is failing continuously on the Image Update Test, the goal of this task is to identify the root cause and fix it.
Description of problem:
Facing error while creating manifests: ./openshift-install create manifests --dir openshift-config FATAL failed to fetch Master Machines: failed to generate asset "Master Machines": failed to create master machine objects: failed to create provider: unexpected end of JSON input Using below document : https://docs.openshift.com/container-platform/4.14/installing/installing_gcp/installing-gcp-vpc.html#installation-gcp-config-yaml_installing-gcp-vpc
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This is a clone of issue OCPBUGS-23744. The following is the description of the original issue:
—
Seen in 4.14 to 4.15 update CI:
: [bz-OLM] clusteroperator/operator-lifecycle-manager-packageserver should not change condition/Available expand_less Run #0: Failed expand_less 1h34m55s { 1 unexpected clusteroperator state transitions during e2e test run Nov 22 21:48:41.624 - 56ms E clusteroperator/operator-lifecycle-manager-packageserver condition/Available reason/ClusterServiceVersionNotSucceeded status/False ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver observed in phase Failed with reason: APIServiceInstallFailed, message: APIService install failed: forbidden: User "system:anonymous" cannot get path "/apis/packages.operators.coreos.com/v1"}
While a brief auth failure isn't fantastic, an issue that only persists for 56ms is not long enough to warrant immediate admin intervention. Teaching the operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required. It's also possible that this is an incoming-RBAC vs. outgoing-RBAC race of some sort, and that shifting manifest filenames around could avoid the hiccup entirely.
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/operator-lifecycle-manager-packageserver+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort periodic-ci-openshift-cluster-etcd-operator-release-4.15-periodics-e2e-aws-etcd-recovery (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 5 runs, 40% failed, 50% of failures match = 20% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-aws-ovn-arm64 (all) - 8 runs, 38% failed, 33% of failures match = 13% impact periodic-ci-openshift-multiarch-master-nightly-4.15-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 5 runs, 20% failed, 400% of failures match = 80% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-ppc64le (all) - 6 runs, 67% failed, 75% of failures match = 50% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-s390x (all) - 6 runs, 100% failed, 33% of failures match = 33% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 5 runs, 40% failed, 50% of failures match = 20% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-sdn-arm64 (all) - 5 runs, 20% failed, 300% of failures match = 60% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-arm64 (all) - 5 runs, 40% failed, 100% of failures match = 40% impact periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-upgrade (all) - 5 runs, 20% failed, 100% of failures match = 20% impact periodic-ci-openshift-release-master-ci-4.15-e2e-aws-upgrade-ovn-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 43 runs, 51% failed, 36% of failures match = 19% impact periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-upgrade (all) - 5 runs, 20% failed, 300% of failures match = 60% impact periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 80 runs, 44% failed, 17% of failures match = 8% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 30% failed, 63% of failures match = 19% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-uwm (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 8 runs, 25% failed, 200% of failures match = 50% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 43% failed, 50% of failures match = 21% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 50 runs, 16% failed, 50% of failures match = 8% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-vsphere-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-from-stable-4.13-e2e-aws-sdn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial (all) - 5 runs, 100% failed, 80% of failures match = 80% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-upgrade-rollback-oldest-supported (all) - 4 runs, 25% failed, 100% of failures match = 25% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade (all) - 50 runs, 18% failed, 178% of failures match = 32% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-sdn-bm-upgrade (all) - 6 runs, 83% failed, 20% of failures match = 17% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 6 runs, 83% failed, 60% of failures match = 50% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.13-e2e-aws-ovn-upgrade-paused (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 6 runs, 17% failed, 100% of failures match = 17% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-sdn-bm-upgrade (all) - 5 runs, 100% failed, 40% of failures match = 40% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 6 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-release-master-okd-4.15-e2e-aws-ovn-upgrade (all) - 19 runs, 63% failed, 33% of failures match = 21% impact periodic-ci-openshift-release-master-okd-scos-4.15-e2e-aws-ovn-upgrade (all) - 15 runs, 47% failed, 57% of failures match = 27% impact
I'm not sure if all of those are from this system:anonymous issue, or if some of them are other mechanisms. Ideally we fix all of the Available=False noise, while, again, still going Available=False when it is worth summoning an admin immediately. Checking for different reason and message strings in recent 4.15-touching update runs:
$ curl -s 'https://search.ci.openshift.org/search?maxAge=48h&type=junit&name=4.15.*upgrade&context=0&search=clusteroperator/operator-lifecycle-manager-packageserver.*condition/Available.*status/False' | jq -r 'to_entries[].value | to_entries[].value[].context[]' | sed 's|.*clusteroperator/\([^ ]*\) condition/Available reason/\([^ ]*\) status/False.*message: \(.*\)|\1 \2 \3|' | sort | uniq -c | sort -n 3 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded APIService install failed: Unauthorized 3 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded install timeout 4 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded install strategy failed: Operation cannot be fulfilled on apiservices.apiregistration.k8s.io "v1.packages.operators.coreos.com": the object has been modified; please apply your changes to the latest version and try again 9 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded apiServices not installed 23 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded install strategy failed: could not create service packageserver-service: services "packageserver-service" already exists 82 operator-lifecycle-manager-packageserver ClusterServiceVersionNotSucceeded APIService install failed: forbidden: User "system:anonymous" cannot get path "/apis/packages.operators.coreos.com/v1"
Lots of hits in the above CI search. Running one of the 100% impact flavors has a good chance at reproducing.
1. Install 4.14
2. Update to 4.15
3. Keep an eye on operator-lifecycle-manager-packageserver's ClusterOperator Available.
Available=False blips.
Available=True the whole time, or any Available=False looks like a serious issue where summoning an admin would have been appropriate.
Causes also these testcases to fail (mentioning them here for Sippy to link here on relevant component readiness failures):
Please review the following PR: https://github.com/openshift/images/pull/151
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This was originally reported in AWS (details below), but the OpenStack configuration suffers the same issue. If the metadata query for the instance name fails on initial boot, kubelet will start with an invalid nodename and will fail to come up.
Description of problem:
worker CSR are pending, so no worker nodes available
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-10-06-234925
How reproducible:
Always
Steps to Reproduce:
Create a cluster with profile - aws-c2s-ipi-disconnected-private-fips
Actual results:
Workers csrs are pending
Expected results:
workers should be up and running all CSRs approved
Additional info:
failed to find machine for node ip-10-143-1-120” , in logs of cluster-machine-approver Seems like we should have ips like “ip-10-143-1-120.ec2.internal” failing here - https://github.com/openshift/cluster-machine-approver/blob/master/pkg/controller/csr_check.go#L263
Must-gather - https://drive.google.com/file/d/15tz9TLdTXrH6bSBSfhlIJ1l_nzeFE1R3/view?usp=sharing
template for installation - https://gitlab.cee.redhat.com/aosqe/flexy-templates/-/blob/master/functionality-testing/aos-4_14/ipi-on-aws/versioned-installer-customer_vpc-disconnected_private_cluster-fips-c2s-ci
Description of problem:
As a part of the forbidden node label e2e test, we execute `oc debug` command to set the forbidden labels on the node. The `oc debug` command is expected to fail while applying the forbidden label.
In our testing, we observed that even though the actual command on the node (kubectl label node/<node> <forbidden_label>) expectedly fails, the `oc debug` command does not carry the return code correctly (it will return 0, even though `kubectl label` fails with error).
Version-Release number of selected component (if applicable):
4.14
How reproducible:
flaky
Steps to Reproduce:
1. Run the test at https://gist.github.com/harche/c9143c382cfe94d7836414d5ccc0ba45 2. Observe that sometimes it flakes at https://gist.github.com/harche/c9143c382cfe94d7836414d5ccc0ba45#file-test-go-L39
Actual results:
oc debug return value flakes
Expected results:
oc debug return value should be consistent.
Additional info:
Please review the following PR: https://github.com/openshift/machine-api-provider-nutanix/pull/58
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Attempting to destroy an AWS cluster can result in an error such as: 2023-11-21T15:04:15Z INFO Deleted role {"role": "53375835bafc21240c89-mgmt-worker-role"} 2023-11-21T15:04:15Z INFO Deleting Secrets {"namespace": "clusters"} 2023-11-21T15:04:15Z INFO Deleted CLI generated secrets 2023-11-21T15:04:15Z ERROR Failed to destroy cluster {"error": "failed to remove finalizer: HostedCluster.hypershift.openshift.io \"53375835bafc21240c89-mgmt\" is invalid: metadata.finalizers: Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{\"hypershift.io/aws-oidc-discovery\"}"} github.com/spf13/cobra.(*Command).execute /hypershift/vendor/github.com/spf13/cobra/command.go:916 github.com/spf13/cobra.(*Command).ExecuteC /hypershift/vendor/github.com/spf13/cobra/command.go:1044 github.com/spf13/cobra.(*Command).Execute /hypershift/vendor/github.com/spf13/cobra/command.go:968 github.com/spf13/cobra.(*Command).ExecuteContext /hypershift/vendor/github.com/spf13/cobra/command.go:961 main.main /hypershift/main.go:70 runtime.main /usr/local/go/src/runtime/proc.go:250 Error: failed to remove finalizer: HostedCluster.hypershift.openshift.io "53375835bafc21240c89-mgmt" is invalid: metadata.finalizers: Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{"hypershift.io/aws-oidc-discovery"} failed to remove finalizer: HostedCluster.hypershift.openshift.io "53375835bafc21240c89-mgmt" is invalid: metadata.finalizers: Forbidden: no new finalizers can be added if the object is being deleted, found new finalizers []string{"hypershift.io/aws-oidc-discovery"}
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Occasionally
Steps to Reproduce:
1. create hosted AWS cluster 2. destroy cluster with `hypershift destroy cluster aws`
Actual results:
In some cases, the destroy will fail with the message in the description
Expected results:
The destroy does not fail while removing the destroy finalizer
Additional info:
Please review the following PR: https://github.com/openshift/kube-state-metrics/pull/105
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
This is a clone of issue OCPBUGS-26566. The following is the description of the original issue:
—
Description of problem:
when user click ‘Cancel’ on any Secret creation page, it doesn’t return to Secrets list page
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-06-062415
How reproducible:
Always
Steps to Reproduce:
1. Go to Create Key/value secret|Image pull secret|Source secret|Webhook secret|FromYaml page eg:/k8s/ns/default/secrets/~new/generic 2. Click Cancel button 3.
Actual results:
The page does not go back to Secrets list page eg: /k8s/ns/default/core~v1~Secret
Expected results:
The page should go back to the Secrets list page
Additional info:
Description of problem:
Business Automation Operands fail to load in uninstall operator modal. With "Cannot load Operands. There was an error loading operands for this operator. Operands will need to be deleted manually..." alert message. "Delete all operand instances for this operator__checkbox" is not shown so the test fails. https://search.ci.openshift.org/?search=Testing+uninstall+of+Business+Automation+Operator&maxAge=168h&context=1&type=junit&name=pull-ci-openshift-console-master-e2e-gcp-console&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Please review the following PR: https://github.com/openshift/bond-cni/pull/59
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
https://github.com/openshift/installer/pull/7778 introduced a bug where an error is always returned while retrieving a marketplace image.
Version-Release number of selected component (if applicable):
How reproducible:
always
Steps to Reproduce:
1. Configure marketplace image in the install-config 2. openshift-install create manifests 3.
Actual results:
$ ./openshift-install create manifests --dir ipi1 --log-level debug DEBUG OpenShift Installer 4.16.0-0.test-2023-12-12-020559-ci-ln-xkqmlqk-latest DEBUG Built from commit 456ae720a83e39dffd9918c5a71388ad873b6a38 DEBUG Fetching Master Machines... DEBUG Loading Master Machines... DEBUG Loading Cluster ID... DEBUG Loading Install Config... DEBUG Loading SSH Key... DEBUG Loading Base Domain... DEBUG Loading Platform... DEBUG Loading Cluster Name... DEBUG Loading Base Domain... DEBUG Loading Platform... DEBUG Loading Pull Secret... DEBUG Loading Platform... INFO Credentials loaded from file "/home/fedora/.azure/osServicePrincipal.json" ERROR failed to fetch Master Machines: failed to load asset "Install Config": failed to create install config: [controlPlane.platform.azure.osImage: Invalid value: azure.OSImage{Plan:"", Publisher:"redhat", Offer:"rh-ocp-worker", SKU:"rh-ocp-worker", Version:"413.92.2023101700"}: could not get marketplace image: %!w(<nil>), compute[0].platform.azure.osImage: Invalid value: azure.OSImage{Plan:"", Publisher:"redhat", Offer:"rh-ocp-worker", SKU:"rh-ocp-worker", Version:"413.92.2023101700"}: could not get marketplace image: %!w(<nil>)]
Expected results:
Success
Additional info:
When {{errors.Wrap(err, ...)}} was replaced by {{fmt.Errorf(...)}}, there is a slight difference in behavior in which {{errors.Wrap}} returns {{nil}} if {{err}} is {{nil}} but {{fmt.Errorf}} always returns an error.
This is a clone of issue OCPBUGS-16814. The following is the description of the original issue:
—
Description of problem:
Starting OpenShift 4.8 (https://docs.openshift.com/container-platform/4.8/release_notes/ocp-4-8-release-notes.html#ocp-4-8-notable-technical-changes), all pods are getting bound SA tokens. Currently, instead of expiring the token, we use the `service-account-extend-token-expiration` that extends a bound token validity to 1yr and warns in case of a use of a token that would've otherwise been expired. We want to disable this behavior in a future OpenShift release, which would break the OpenShift web console.
Version-Release number of selected component (if applicable):
4.8 - 4.14
How reproducible:
100%
Steps to Reproduce:
1. install a fresh cluster 2. wait ~1hr since console pods were deployed for the token rotation to occur 3. log in to the console and click around 4. check the kube-apiserver audit logs events for the "authentication.k8s.io/stale-token" annotation
Actual results:
many occurrences (I doubt I'll be able to upload a text file so I'll show a few audit events in the first comment.
Expected results:
The web-console re-reads the SA token regularly so that it never uses an expired token
Additional info:
In a theoretical case where a console pod lasts for a year, it's going to break and won't be able to authenticate to the kube-apiserver. We are planning on disallowing the use of stale tokens in a future release and we need to make sure that the core platform is not broken so that the metrics we collect from the clusters in the wild are not polluted.
This is a clone of issue OCPBUGS-27445. The following is the description of the original issue:
—
Description of problem:
Client side throttling observed when running the metrics controller.
Steps to Reproduce:
1. Install an AWS cluster in mint mode 2. Enable debug log by editing cloudcredential/cluster 3. Wait for the metrics loop to run for a few times 4. Check CCO logs
Actual results:
// 7s consumed by metrics loop which is caused by client-side throttling time="2024-01-20T19:43:56Z" level=info msg="calculating metrics for all CredentialsRequests" controller=metrics I0120 19:43:56.251278 1 request.go:629] Waited for 176.161298ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-cloud-network-config-controller/secrets/cloud-credentials I0120 19:43:56.451311 1 request.go:629] Waited for 197.182213ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-cloud-network-config-controller/secrets/cloud-credentials I0120 19:43:56.651313 1 request.go:629] Waited for 197.171082ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-cloud-network-config-controller/secrets/cloud-credentials I0120 19:43:56.850631 1 request.go:629] Waited for 195.251487ms due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/api/v1/namespaces/openshift-cloud-network-config-controller/secrets/cloud-credentials ... time="2024-01-20T19:44:03Z" level=info msg="reconcile complete" controller=metrics elapsed=7.231061324s
Expected results:
No client-side throttling when running the metrics controller.
MON-2967 and cmo#1890 moved the Observe console menu into a console plugin (in 4.15? 4.14?). Sometimes If-Modified-Since browser caching results in failures that result in a missing Observe menu, and when the user eventually finds /k8s/cluster/operator.openshift.io~v1~Console/cluster/console-plugins, render failure as:
Failed to get a valid plugin manifest from /api/plugins/monitoring-plugin/ SyntaxError: Unexpected end of JSON input
This appears to be the result of the browser's If-Modified-Since caching:
$ curl -sH Accept:application/json -H Cache-Control:max-age=0 -H 'Cookie: openshift-session-token=...; login-state=...; ...; csrf-token=...' -H 'If-Modified-Since: Fri, 03 Nov 2023 00:47:45 GMT' -i https://console.build02.ci.openshift.org/api/plugins/monitoring-plugin/plugin-manifest.json HTTP/1.1 200 OK date: Tue, 21 Nov 2023 16:52:55 GMT etag: "65444331-9a2" last-modified: Fri, 03 Nov 2023 00:47:45 GMT referrer-policy: strict-origin-when-cross-origin server: nginx/1.20.1 x-content-type-options: nosniff x-dns-prefetch-control: off x-frame-options: DENY x-xss-protection: 1; mode=block content-length: 0
While a more recent If-Modified-Since returns populated JSON:
$ curl -sH Accept:application/json -H 'If-Modified-Since: Fri, 10 Nov 2023 10:47:45 GMT' -H 'Cookie: openshift-session-token=...; login-state=...; ...; csrf-token=...' https://console.build02.ci.openshift.org/api/plugins/monitoring-plugin/plugin-manifest.json | jq . | head { "name": "monitoring-plugin", "version": "1.0.0", "displayName": "OpenShift console monitoring plugin", "description": "This plugin adds the monitoring UI to the OpenShift web console", "dependencies": { "@console/pluginAPI": "*" }, "extensions": [ {
Disabling caching on the monitoring-plugin side would avoid this issues. But fixing 304 handling in the console's proxy would likely also resolve the issue.
Seen in 4.15.0-ec.2. Reproduced in ec.2. Failed to reproduce in ec.1. Possibly a regression from ec.1 to ec.2, although I haven't identified a regressing commit yet.
Seen multiple times by multiple users in 4.15.0-ec.2 in two long-lived clusters, and also reproduced in an ec.2 Cluster Bot cluster. Likely consistently reprodible on ec.2.
1. Install a cluster, e.g. with launch 4.15.0-ec.2 gcp.
2. Log into the console and use the developer tab to get an openshift-session-token value from a successful HTTPS request.
3.
$ curl -ksi -H "Cookie: openshift-session-token=${TOKEN}" "https://${HOST}/api/plugins/monitoring-plugin/plugin-manifest.json" | grep 'HTTP\|content-\|last-modified'
with your ${TOKEN} and ${HOST}, to confirm 200 responses and find the last-modified value.
4.
$ curl -ksi -H "If-Modified-Since: ${LAST_MODIFIED}" -H "Cookie: openshift-session-token=${TOKEN}" "https://${HOST}/api/plugins/monitoring-plugin/plugin-manifest.json"
with your ${TOKEN}, ${HOST}, and ${LAST_MODIFIED}.
Observe menu is missing, with browser-console logs like:
Failed to get a valid plugin manifest from /api/plugins/monitoring-plugin/ SyntaxError: Unexpected end of JSON input
200 responses with no content when If-Modified-Since is greater than or equal to the content's last-modified.
Reliably successful loading of the monitoring console plugin, with a 304 when If-Modified-Since is greater than or equal to the content's last-modified.
Possibly more obvious warnings pointing at /k8s/cluster/operator.openshift.io~v1~Console/cluster/console-plugins when plugins fail to load.
Using the browser's development tools to disable caching while loading the console avoids the problematic caching interaction.
Description of problem:
The Deployment option is missing in 'Click on the names to access advanced options' list in Deploy image page, user cannot set up ENV related function anymore
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2023-11-20-205649
How reproducible:
Always
Steps to Reproduce:
1. Login OCP, and change to Developer perspective, navigate to Deploy Image page (+Add -> Container image) /deploy-image/ns/default 2. Scroll down and check if 'deployment' is list in the advance list 3.
Actual results:
deployment is missing in the advance list, user is not able to update the Environment variables anymore
Expected results:
deployment exist
Additional info:
https://drive.google.com/file/d/1ixQ33DdGzZTAWgzrpp57OqHGFS4v1_3T/view?usp=drive_link https://drive.google.com/file/d/1dpgFtsr45IovSriwu0RPd0kq0DejRSAm/view?usp=drive_link
Description of problem:
Add flags to hide Pipeline list pages and details pages from static plugin. So that list and details pages from the Pipeline dynamic plugin is shown in the console
Please review the following PR: https://github.com/openshift/csi-livenessprobe/pull/47
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Keepalived constantly fails on bootstrap causing installation failure
Seems like it doesn't have keepalived.conf file and keepalived monitor fails on
Version-Release number of selected component (if applicable):
4.13.12
How reproducible:
Regular installation through assisted installer
Steps to Reproduce:
1. 2. 3.
Actual results:
keepalived fails to start
Expected results:
Success
Additional info:
*
Description of problem:
metal3-baremetal-operator-7ccb58f44b-xlnnd pod failed to start on the SNO baremetal dualstack cluster: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 34m default-scheduler Successfully assigned openshift-machine-api/metal3-baremetal-operator-7ccb58f44b-xlnnd to sno.ecoresno.lab.eng.tlv2.redha t.com Warning FailedScheduling 34m default-scheduler 0/1 nodes are available: 1 node(s) didn't have free ports for the requested pod ports. preemption: 0/1 nodes are availabl e: 1 node(s) didn't have free ports for the requested pod ports.. Warning FailedCreatePodSandBox 34m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to add hostport mapping for sandbox k8s_metal3-baremetal-operator-7ccb58f44b-xlnnd_openshift-machine-api_5f6d8c69-a508-47f3-a6b1-7701b9d3617e_0(c4a8b353e3ec105d2bff2eb1670b82a0f226ac1088b739a256deb9dfae6ebe54): cannot open hostport 60000 for pod k8s _metal3-baremetal-operator-7ccb58f44b-xlnnd_openshift-machine-api_5f6d8c69-a508-47f3-a6b1-7701b9d3617e_0_: listen tcp4 :60000: bind: address already in use Warning FailedCreatePodSandBox 34m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to add hostport mapping for sandbox k8s_metal3-bare metal-operator-7ccb58f44b-xlnnd_openshift-machine-api_5f6d8c69-a508-47f3-a6b1-7701b9d3617e_0(9e6960899533109b02fbb569c53d7deffd1ac8185cef3d8677254f9ccf9387ff): cannot open hostport 60000 for pod k8s _metal3-baremetal-operator-7ccb58f44b-xlnnd_openshift-machine-api_5f6d8c69-a508-47f3-a6b1-7701b9d3617e_0_: listen tcp4 :60000: bind: address already in use
Version-Release number of selected component (if applicable):
4.14.0-rc.0
How reproducible:
so far once
Steps to Reproduce:
1. Deploy disconnected baremetal SNO node with dualstack networking with agent-based installer 2. 3.
Actual results:
metal3-baremetal-operator pod fails to start
Expected results:
metal3-baremetal-operator pod is running
Additional info:
Checking the pots on node showed it was `kube-apiserver` process bound to the port: tcp ESTAB 0 0 [::1]:60000 [::1]:2379 users:(("kube-apiserver",pid=43687,fd=455)) After rebooting the node all pods started as expected
Please review the following PR: https://github.com/openshift/cluster-openshift-apiserver-operator/pull/548
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Recently lextudio dropped pyasn1 so we want to be explicit and show that we install pysnmp-lextudio but normal pyasn1
Description of problem:
Pipeline E2E tests have been disabled as the CI is failing. The probable guess is that our clusters says that we're 4.15 now and that the operator couldn't be found because its only compatible with 4.x-4.14.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of problem:
The degradation of the storage operator occurred because it couldn't locate the node by UUID. I noticed that the providerID was present for node 0, but it was blank for other nodes. A successful installation can be achieved on day 2 by executing step 4 after step 7 from this document: https://access.redhat.com/solutions/6677901. Additionally, if we provide credentials from the install-config, it's necessary to add a taint to the node using the uninitialized taint(oc adm taint node "$NODE" node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule) after the bootstrap completed.
Version-Release number of selected component (if applicable):
4.15
How reproducible:
100%
Steps to Reproduce:
1. Create an agent ISO image 2. Boot the created ISO on vSphere VM
Actual results:
Installation is failing due to storage operator unable to find the node by UUID.
Expected results:
Storage operator should be installed without any issue.
Additional info:
Slack discussion: https://redhat-internal.slack.com/archives/C02SPBZ4GPR/p1702893456002729
After updating our husky dependency, the pre-commit hook might fail on some systems if their PATH env var is not properly configured:
{{}}
Running husky pre-commit hook... frontend/.husky/pre-commit: line 6: lint-staged: command not found husky - pre-commit hook exited with code 127 (error) husky - command not found in PATH=<user path>
The PATH env var must include "./node_modules/.bin" for the husky pre-commit hook to work, which should be documented in the README.
Issue 33 from https://docs.google.com/spreadsheets/d/1TR3ENY-GE_LQL9F-xH6NahtRHdu6IrYGhLqDDA8y0EI/edit#gid=1035185624
In left navigation menu in dev perspective, after divider, there is extra space.
Screenshot: https://drive.google.com/file/d/1ROcHXCLmPPhr30nGTUblMTL-JQqKEsCY/view?usp=drive_link
Description of problem:
The network resource provisioning playbook for 4.15 dualstack UPI contains a task for adding an IPv6 subnet to the existing external router [1]. This task fails with: - ansible-2.9.27-1.el8ae.noarch & ansible-collections-openstack-1.8.0-2.20220513065417.5bb8312.el8ost.noarch in OSP 16 env (RHEL 8.5) or - openstack-ansible-core-2.14.2-4.1.el9ost.x86_64 & ansible-collections-openstack-1.9.1-17.1.20230621074746.0e9a6f2.el9ost.noarch in OSP 17 env (RHEL 9.2)
Version-Release number of selected component (if applicable):
4.15.0-0.nightly-2024-01-22-160236
How reproducible:
Always
Steps to Reproduce:
1. Set the os_subnet6 in the inventory file for setting dualstack 2. Run the 4.15 network.yaml playbook
Actual results:
Playbook fails: TASK [Add IPv6 subnet to the external router] ********************************** fatal: [localhost]: FAILED! => {"changed": false, "extra_data": {"data": null, "details": "Invalid input for external_gateway_info. Reason: Validation of dictionary's keys failed. Expected keys: {'network_id'} Provided keys: {'external_fixed_ips'}.", "response": "{\"NeutronError\": {\"type\": \"HTTPBadRequest\", \"message\": \"Invalid input for external_gateway_info. Reason: Validation of dictionary's keys failed. Expected keys: {'network_id'} Provided keys: {'external_fixed_ips'}.\", \"detail\": \"\"}}"}, "msg": "Error updating router 8352c9c0-dc39-46ed-94ed-c038f6987cad: Client Error for url: https://10.46.43.81:13696/v2.0/routers/8352c9c0-dc39-46ed-94ed-c038f6987cad, Invalid input for external_gateway_info. Reason: Validation of dictionary's keys failed. Expected keys: {'network_id'} Provided keys: {'external_fixed_ips'}."}
Expected results:
Successful playbook execution
Additional info:
The router can be created in two different tasks, the playbook [2] worked for me.
[1] https://github.com/openshift/installer/blob/1349161e2bb8606574696bf1e3bc20ae054e60f8/upi/openstack/network.yaml#L43
[2] https://file.rdu.redhat.com/juriarte/upi/network.yaml
Description of problem:
Using the web console on the RH Developer Sandbox, created the most basic Knative Service (KSVC) using the default suggested, ie image openshift/hello-openshift. Then tried to change the displayed icon using the web UI and an error about Probes was displayed. See attached images. The error has no relevance to the item changed.
Version-Release number of selected component (if applicable):
whatever the RH sandbox uses, this value is not displayed to users
How reproducible:
very
Steps to Reproduce:
Using the web console on the RH Developer Sandbox, created the most basic Knative Service (KSVC) using the default image openshift/hello-openshift. Then used the webUi to edit the KSVC sample to change the icon used from an OpenShift logo to a 3Scale logo for instance. When saving from this form an error was reported: admission webhook 'validation webhook.serving.knative.dev' denied the request: validation failed: must not set the field(s): spec.template.spec.containers[0].readiness.Probe
Actual results:
Expected results:
Either a failure message related to changing the icon, or the icon change to take effect
Additional info:
KSVC details as provided by the web console. apiVersion: serving.knative.dev/v1 kind: Service metadata: name: sample namespace: agroom-dev spec: template: spec: containers: - image: openshift/hello-openshift
Description of problem: Updating the ovn-kubernetes submodules in the windows-machine-config-operator causes nodes to have permission errors setting annotations
E0927 19:37:53.178022 4932 kube.go:130] Error in setting annotation on node ci-op-56c3qr7h-8411c-wdmq9-e2e-wm-xs6sc: admission webhook "node.network-node-identity.openshift.io" denied the request: user "system:node:ci-op-56c3qr7h-8411c-wdmq9-e2e-wm-xs6sc" is not allowed to set the following annotations on node: "ci-op-56c3qr7h-8411c-wdmq9-e2e-wm-xs6sc": [k8s.ovn.org/hybrid-overlay-distributed-router-gateway-mac]
seen in
https://github.com/openshift/windows-machine-config-operator/pull/1836
Description of problem:
GCP e2-custom-* instance type is not supported by our E2E test framework. Now that testplatform have started using those instance types, we are seeing permafailing E2E job runs on our CPMS E2E periodic tests. Error sample: • [FAILED] [285.539 seconds]475ControlPlaneMachineSet Operator With an active ControlPlaneMachineSet and the instance type is changed [BeforeEach] should perform a rolling update [Periodic]476 [BeforeEach] /go/src/github.com/openshift/cluster-control-plane-machine-set-operator/test/e2e/periodic_test.go:39477 [It] /go/src/github.com/openshift/cluster-control-plane-machine-set-operator/test/e2e/periodic_test.go:43478479 [FAILED] provider spec should be updated with bigger instance size480 Expected success, but got an error:481 <*fmt.wrapError | 0xc000358380>: 482 failed to get next instance size: instance type did not match expected format: e2-custom-6-16384483 {484 msg: "failed to get next instance size: instance type did not match expected format: e2-custom-6-16384",485 err: <*fmt.wrapError | 0xc000358360>{486 msg: "instance type did not match expected format: e2-custom-6-16384",487 err: <*errors.errorString | 0xc0001489f0>{488 s: "instance type did not match expected format",489 },490 },491 }
Version-Release number of selected component (if applicable):
How reproducible:
Always
Steps to Reproduce:
1. Use e2-custom in GCP in a cluster, run CPMSO E2E periodics 2. 3.
Actual results:
Permafailing E2Es
Expected results:
Successful E2Es
Additional info:
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were not completed when this image was assembled
Description of problem:
Configuring mTLS on default IngressController breaks ingress canary check & console health checks which in turn makes the ingress and console cluster operators into a degraded state.
OpenShift release version:
OCP-4.9.5
Cluster Platform:
UPI on Baremetal (Disconnected cluster)
How reproducible:
Configure mutual TLS/mTLS using default IngressController as described in the doc(https://docs.openshift.com/container-platform/4.9/networking/ingress-operator.html#nw-mutual-tls-auth_configuring-ingress)
Steps to Reproduce (in detail):
1. Create a config map that is in the openshift-config namespace.
2. Edit the IngressController resource in the openshift-ingress-operator project
3.Add the spec.clientTLS field and subfields to configure mutual TLS:
~~~
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
name: default
namespace: openshift-ingress-operator
spec:
clientTLS:
clientCertificatePolicy: Required
clientCA:
name: router-ca-certs-default
allowedSubjectPatterns:
Expected results:
mTLS setup should work properly without degrading the Ingress and Console operators.
Impact of the problem:
Instable cluster with Ingress and Console operators into Degraded state.
Additional info:
The following is the Error message for your reference:
The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: CanaryChecksSucceeding=False (CanaryChecksRepetitiveFailures: Canary route checks for the default ingress controller are failing)
// Canary checks looking for required tls certificate.
2021-11-19T17:17:58.237Z ERROR operator.canary_controller wait/wait.go:155 error performing canary route check
// Console operator:
RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.bruce.openshift.local): Get "https://console-openshift-console.apps.bruce.openshift.local": remote error: tls: certificate required
Description of problem:
It was notices that the openshift-hyperkube RPM which is primarilly, perhaps exclusively, used to install the kubelet in RHCOS or other environments included kube-apiserver, kube-controller-manager, and kube-scheduler binaries. Those binaries are all built and used via container images, which as far as I can tell don't make use of the RPM.
Version-Release number of selected component (if applicable):
4.12 - 4.16
How reproducible:
100%
Steps to Reproduce:
1. rpm -ql openshift-hyperkube on any node 2. 3.
Actual results:
# rpm -ql openshift-hyperkube /usr/bin/hyperkube /usr/bin/kube-apiserver /usr/bin/kube-controller-manager /usr/bin/kube-scheduler /usr/bin/kubelet /usr/bin/kubensenter # ls -lah /usr/bin/kube-apiserver /usr/bin/kube-controller-manager /usr/bin/kube-scheduler /usr/bin/hyperkube /usr/bin/kubensenter /usr/bin/kubelet -rwxr-xr-x. 2 root root 945 Jan 1 1970 /usr/bin/hyperkube -rwxr-xr-x. 2 root root 129M Jan 1 1970 /usr/bin/kube-apiserver -rwxr-xr-x. 2 root root 114M Jan 1 1970 /usr/bin/kube-controller-manager -rwxr-xr-x. 2 root root 54M Jan 1 1970 /usr/bin/kube-scheduler -rwxr-xr-x. 2 root root 105M Jan 1 1970 /usr/bin/kubelet -rwxr-xr-x. 2 root root 3.5K Jan 1 1970 /usr/bin/kubensenter
Expected results:
Just the kubelet and deps on the host OS, that's all that's necessary
Additional info:
My proposed change would be for people that cared about making this slim to install `openshift-hyperkube-kubelet` instead.
Description of problem:
Hi observing below testcase failure in 4.14 powerVS continuously ,which causes success rate of prod CI less.
[bz-XXXitoring][invariant] alert/Watchdog must have no gaps or changes
There is no error message apart from the following line, couldn't gather any more related logs
{ Watchdog alert not found}
The https://github.com/pkg/errors repo has been archived on Dec, 2021. See also https://github.com/pkg/errors/issues/245.
We should probably use `fmt.Errorf("... %w", err)` instead.
Description of problem:
Reviewing 4.15 Install failures there are a number of variants impacted by recent install failures.
search.ci: Cluster operator console is not available
Jobs like periodic-ci-openshift-release-master-nightly-4.15-e2e-gcp-sdn-serial show failures that appear to start with 4.15.0-0.nightly-2023-12-07-225558 have installation failures due to console-operator
ConsoleOperator reconciliation failed: Operation cannot be fulfilled on consoles.operator.openshift.io "cluster": the object has been modified; please apply your changes to the latest version and try again
4.15.0-0.nightly-2023-12-07-225558 contains console-operator/pull/814, noting in case it is related
Version-Release number of selected component (if applicable):
4.15
How reproducible:
Steps to Reproduce:
1. Review link to install failures above 2. 3.
Actual results:
Expected results:
Additional info:
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-sdn
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-sdn-upgrade
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade
Description of problem:
Demo dynamic plugin tests are not working when running in dev mode because changes were made in the plugin table structure.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
https://issues.redhat.com/browse/RHEL-1671 introduces "dns-changed" event that resolv-prepender should act on. So now instead of a bunch of "-change" and "up" and "whatnot" events we have the one that clearly indicates that the DNS has been changed.
By embedding this into our logic, we will heavily optimize number of times our scripts are called.
It is important to check when exactly this is going to be shipped so that we synchronize our change with upstream NM.
Add `madhu-pillai` in the `coreos-approvers` and `coreos-reviewers` lists.
Please review the following PR: https://github.com/openshift/cluster-config-operator/pull/353
The PR has been automatically opened by ART (#forum-ocp-art) team automation and indicates
that the image(s) being used downstream for production builds are not consistent
with the images referenced in this component's github repository.
Differences in upstream and downstream builds impact the fidelity of your CI signal.
If you disagree with the content of this PR, please contact @release-artists
in #forum-ocp-art to discuss the discrepancy.
Closing this issue without addressing the difference will cause the issue to
be reopened automatically.
Description of problem:
When creating an ImageDigestMirrorSet with conflicting mirrorSourcePolicy, it didn't prompt error.
Version-Release number of selected component (if applicable):
% oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.0-0.nightly-2024-01-14-100410 True False 27m Cluster version is 4.15.0-0.nightly-2024-01-14-100410
How reproducible:
always
Steps to Reproduce:
1. create an ImageContentSourcePolicy ImageContentSourcePolicy.yaml: apiVersion: operator.openshift.io/v1alpha1 kind: ImageContentSourcePolicy metadata: name: ubi8repo spec: repositoryDigestMirrors: - mirrors: - example.io/example/ubi-minimal - example.com/example/ubi-minimal source: registry.access.redhat.com/ubi6/ubi-minimal - mirrors: - mirror.example.net source: registry.example.com/example 2.After the mcp finish updating, check the /etc/containers/registries.conf update as expected 3.create an ImageDigestMirrorSet with conflicting mirrorSourcePolicy for the same source "registry.example.com/example" ImageDigestMirrorSet-conflict.yaml: apiVersion: config.openshift.io/v1 kind: ImageDigestMirrorSet metadata: name: digest-mirror spec: imageDigestMirrors: - mirrors: - example.io/example/ubi-minimal - example.com/example/ubi-minimal source: registry.access.redhat.com/ubi8/ubi-minimal mirrorSourcePolicy: AllowContactingSource - mirrors: - mirror.example.net source: registry.example.com/example mirrorSourcePolicy: NeverContactSource
Actual results:
3. create successfully, but the mcp didn't get updated and no relevant mc generated. The machine-config-controller log showed: I0116 02:34:03.897335 1 container_runtime_config_controller.go:417] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not update registries config with new changes: conflicting mirrorSourcePolicy is set for the same source "registry.example.com/example" in imagedigestmirrorsets and/or imagetagmirrorsets
Expected results:
3. it should prompt: there exist conflicting mirrorSourcePolicy for the same source "registry.example.com/example" in ICSP
Additional info:
While debugging a problem, I noticed some containers lack FallbackToLogsOnError. This is important for debugging via the API. Found via https://github.com/openshift/origin/pull/28547
this is case 2 from OCPBUGS-14673
Description of problem:
MHC for control plane cannot work right for some cases 2.Stop the kubelet service on the master node, new master get Running, the old one stuck in Deleting, many co degraded. This is a regression bug, because I tested this on 4.12 around September 2022, case 2 and case 3 work right. https://polarion.engineering.redhat.com/polarion/#/project/OSE/workitem?id=OCP-54326
Version-Release number of selected component (if applicable):
4.14.0-0.nightly-2023-06-05-112833 4.13.0-0.nightly-2023-06-06-194351 4.12.0-0.nightly-2023-06-07-005319
How reproducible:
Always
Steps to Reproduce:
1.Create MHC for control plane apiVersion: machine.openshift.io/v1beta1 kind: MachineHealthCheck metadata: name: control-plane-health namespace: openshift-machine-api spec: maxUnhealthy: 1 selector: matchLabels: machine.openshift.io/cluster-api-machine-type: master unhealthyConditions: - status: "False" timeout: 300s type: Ready - status: "Unknown" timeout: 300s type: Ready liuhuali@Lius-MacBook-Pro huali-test % oc create -f mhc-master3.yaml machinehealthcheck.machine.openshift.io/control-plane-health created liuhuali@Lius-MacBook-Pro huali-test % oc get mhc NAME MAXUNHEALTHY EXPECTEDMACHINES CURRENTHEALTHY control-plane-health 1 3 3 machine-api-termination-handler 100% 0 0 Case 2.Stop the kubelet service on the master node, new master get Running, the old one stuck in Deleting, many co degraded. liuhuali@Lius-MacBook-Pro huali-test % oc debug node/huliu-az7c-svq9q-master-1 Starting pod/huliu-az7c-svq9q-master-1-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.0.6 If you don't see a command prompt, try pressing enter. sh-4.4# chroot /host sh-5.1# systemctl stop kubelet Removing debug pod ... liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION huliu-az7c-svq9q-master-1 Ready control-plane,master 95m v1.26.5+7a891f0 huliu-az7c-svq9q-master-2 Ready control-plane,master 95m v1.26.5+7a891f0 huliu-az7c-svq9q-master-c96k8-0 Ready control-plane,master 19m v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-5r8jf Ready worker 34m v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-k747l Ready worker 47m v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-r2vdn Ready worker 83m v1.26.5+7a891f0 liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-az7c-svq9q-master-1 Running Standard_D8s_v3 westus 97m huliu-az7c-svq9q-master-2 Running Standard_D8s_v3 westus 97m huliu-az7c-svq9q-master-c96k8-0 Running Standard_D8s_v3 westus 23m huliu-az7c-svq9q-worker-westus-5r8jf Running Standard_D4s_v3 westus 39m huliu-az7c-svq9q-worker-westus-k747l Running Standard_D4s_v3 westus 53m huliu-az7c-svq9q-worker-westus-r2vdn Running Standard_D4s_v3 westus 91m liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION huliu-az7c-svq9q-master-1 NotReady control-plane,master 107m v1.26.5+7a891f0 huliu-az7c-svq9q-master-2 Ready control-plane,master 107m v1.26.5+7a891f0 huliu-az7c-svq9q-master-c96k8-0 Ready control-plane,master 32m v1.26.5+7a891f0 huliu-az7c-svq9q-master-jdhgg-1 Ready control-plane,master 2m10s v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-5r8jf Ready worker 46m v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-k747l Ready worker 59m v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-r2vdn Ready worker 95m v1.26.5+7a891f0 liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-az7c-svq9q-master-1 Deleting Standard_D8s_v3 westus 110m huliu-az7c-svq9q-master-2 Running Standard_D8s_v3 westus 110m huliu-az7c-svq9q-master-c96k8-0 Running Standard_D8s_v3 westus 36m huliu-az7c-svq9q-master-jdhgg-1 Running Standard_D8s_v3 westus 5m55s huliu-az7c-svq9q-worker-westus-5r8jf Running Standard_D4s_v3 westus 52m huliu-az7c-svq9q-worker-westus-k747l Running Standard_D4s_v3 westus 65m huliu-az7c-svq9q-worker-westus-r2vdn Running Standard_D4s_v3 westus 103m liuhuali@Lius-MacBook-Pro huali-test % oc get machine NAME PHASE TYPE REGION ZONE AGE huliu-az7c-svq9q-master-1 Deleting Standard_D8s_v3 westus 3h huliu-az7c-svq9q-master-2 Running Standard_D8s_v3 westus 3h huliu-az7c-svq9q-master-c96k8-0 Running Standard_D8s_v3 westus 105m huliu-az7c-svq9q-master-jdhgg-1 Running Standard_D8s_v3 westus 75m huliu-az7c-svq9q-worker-westus-5r8jf Running Standard_D4s_v3 westus 122m huliu-az7c-svq9q-worker-westus-k747l Running Standard_D4s_v3 westus 135m huliu-az7c-svq9q-worker-westus-r2vdn Running Standard_D4s_v3 westus 173m liuhuali@Lius-MacBook-Pro huali-test % oc get node NAME STATUS ROLES AGE VERSION huliu-az7c-svq9q-master-1 NotReady control-plane,master 178m v1.26.5+7a891f0 huliu-az7c-svq9q-master-2 Ready control-plane,master 178m v1.26.5+7a891f0 huliu-az7c-svq9q-master-c96k8-0 Ready control-plane,master 102m v1.26.5+7a891f0 huliu-az7c-svq9q-master-jdhgg-1 Ready control-plane,master 72m v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-5r8jf Ready worker 116m v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-k747l Ready worker 129m v1.26.5+7a891f0 huliu-az7c-svq9q-worker-westus-r2vdn Ready worker 165m v1.26.5+7a891f0 liuhuali@Lius-MacBook-Pro huali-test % oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.13.0-0.nightly-2023-06-06-194351 True True True 107m APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-oauth-apiserver ()... baremetal 4.13.0-0.nightly-2023-06-06-194351 True False False 174m cloud-controller-manager 4.13.0-0.nightly-2023-06-06-194351 True False False 176m cloud-credential 4.13.0-0.nightly-2023-06-06-194351 True False False 3h cluster-autoscaler 4.13.0-0.nightly-2023-06-06-194351 True False False 173m config-operator 4.13.0-0.nightly-2023-06-06-194351 True False False 175m console 4.13.0-0.nightly-2023-06-06-194351 True False False 136m control-plane-machine-set 4.13.0-0.nightly-2023-06-06-194351 True False False 71m csi-snapshot-controller 4.13.0-0.nightly-2023-06-06-194351 True False False 174m dns 4.13.0-0.nightly-2023-06-06-194351 True True False 173m DNS "default" reports Progressing=True: "Have 6 available node-resolver pods, want 7." etcd 4.13.0-0.nightly-2023-06-06-194351 True True True 173m NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) image-registry 4.13.0-0.nightly-2023-06-06-194351 True True False 165m Progressing: The registry is ready... ingress 4.13.0-0.nightly-2023-06-06-194351 True False False 165m insights 4.13.0-0.nightly-2023-06-06-194351 True False False 168m kube-apiserver 4.13.0-0.nightly-2023-06-06-194351 True True True 171m NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) kube-controller-manager 4.13.0-0.nightly-2023-06-06-194351 True False True 171m NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) kube-scheduler 4.13.0-0.nightly-2023-06-06-194351 True False True 171m NodeControllerDegraded: The master nodes not ready: node "huliu-az7c-svq9q-master-1" not ready since 2023-06-07 08:47:34 +0000 UTC because NodeStatusUnknown (Kubelet stopped posting node status.) kube-storage-version-migrator 4.13.0-0.nightly-2023-06-06-194351 True False False 106m machine-api 4.13.0-0.nightly-2023-06-06-194351 True False False 167m machine-approver 4.13.0-0.nightly-2023-06-06-194351 True False False 174m machine-config 4.13.0-0.nightly-2023-06-06-194351 False False True 60m Cluster not available for [{operator 4.13.0-0.nightly-2023-06-06-194351}]: failed to apply machine config daemon manifests: error during waitForDaemonsetRollout: [timed out waiting for the condition, daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1)] marketplace 4.13.0-0.nightly-2023-06-06-194351 True False False 174m monitoring 4.13.0-0.nightly-2023-06-06-194351 True False False 106m network 4.13.0-0.nightly-2023-06-06-194351 True True False 177m DaemonSet "/openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)... node-tuning 4.13.0-0.nightly-2023-06-06-194351 True False False 173m openshift-apiserver 4.13.0-0.nightly-2023-06-06-194351 True True True 107m APIServerDeploymentDegraded: 1 of 4 requested instances are unavailable for apiserver.openshift-apiserver () openshift-controller-manager 4.13.0-0.nightly-2023-06-06-194351 True False False 170m openshift-samples 4.13.0-0.nightly-2023-06-06-194351 True False False 167m operator-lifecycle-manager 4.13.0-0.nightly-2023-06-06-194351 True False False 174m operator-lifecycle-manager-catalog 4.13.0-0.nightly-2023-06-06-194351 True False False 174m operator-lifecycle-manager-packageserver 4.13.0-0.nightly-2023-06-06-194351 True False False 168m service-ca 4.13.0-0.nightly-2023-06-06-194351 True False False 175m storage 4.13.0-0.nightly-2023-06-06-194351 True True False 174m AzureDiskCSIDriverOperatorCRProgressing: AzureDiskDriverNodeServiceControllerProgressing: Waiting for DaemonSet to deploy node pods... liuhuali@Lius-MacBook-Pro huali-test % ----------------------- There might be an easier way by just rolling a revision in etcd, stopping kubelet and then observing the same issue.
Actual results:
CEO's member removal controller is getting stuck on the IsBootstrapComplete check that was introduced to fix another bug: https://github.com/openshift/cluster-etcd-operator/commit/c96150992a8aba3654835787be92188e947f557c#diff-d91047e39d2c1ab6b35e69359a24e83c19ad9b3e9ad4e44f9b1ac90e50f7b650R97 turns out IsBootstrapComplete checks whether a revision is currently rolling out (makes sense) and that one NotReady node with kubelet gone still has a revision going (rev 7, target 9). more info: https://issues.redhat.com/browse/OCPBUGS-14673?focusedId=22726712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-22726712 This causes the etcd member to not be removed. Which in turn blocks the vertical scale-down procedure to remove the pre-drain hook as the member is still present. Effectively you end up with a cluster of 4 CP machines, where one is stuck in Deleting state.
Expected results:
The etcd member should be removed and the machine/node should be deleted
Additional info:
Removing the revision check does fix this issue reliably, but might not be desirable: https://github.com/openshift/cluster-etcd-operator/pull/1087
Document URL:
[1] https://docs.openshift.com/container-platform/4.15/installing/installing_aws/installing-aws-account.html#installation-aws-permissions_installing-aws-account
Section Number and Name:
* Required EC2 permissions for installation
Description of problem:
The permission ec2:DisassociateAddress is required for OCP 4.16+ install, but it's missing the official doc [1] - we would like to understand why/if this permission is necessary. level=info msg=Destroying the bootstrap resources... ... level=error msg=Error: disassociating EC2 EIP (eipassoc-01e8cc3f06f2c2499): UnauthorizedOperation: You are not authorized to perform this operation. User: arn:aws:iam::301721915996:user/ci-op-0xjvtwb0-4e979-minimal-perm is not authorized to perform: ec2:DisassociateAddress on resource: arn:aws:ec2:us-east-1:301721915996:elastic-ip/eipalloc-0274201623d8569af because no identity-based policy allows the ec2:DisassociateAddress action.
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-03-13-061822
How reproducible:
Always
Steps to Reproduce:
1. Create OCP cluster with permissions listed in the official doc. 2. 3.
Actual results:
See description.
Expected results:
Cluster is created successfully.
Suggestions for improvement:
Add ec2:DisassociateAddress to `Required EC2 permissions for installation` in [1]
Additional info:
This impacts the permission list in ROSA Installer-Role as well.
Description of problem:
Install IPI cluster with confidential VM, installer should have pre-check for vm type, disk encryption type etc to avoid installation failed during infrastructure creation 1. vm type Different security type support on different vm type for example, set platfrom.azure.defaultMachinePlatform.type to Standard_DC8ads_v5 and platform.azure.defaultMachinePlatform.settings.securityType to TrustedLaunch, installation will be failed as Standard_DC8ads_v5 only support security type ConfidentialVM ERROR Error: creating Linux Virtual Machine: (Name "jimaconf1-89qmp-bootstrap" / Resource Group "jimaconf1-89qmp-rg"): compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="BadRequest" Message="The VM size 'Standard_DC16ads_v5' is not supported for creation of VMs and Virtual Machine Scale Set with 'TrustedLaunch' security type." 2. Disk encryption Set When install cluster with ConfidentialVM +securityEncryptionType:DiskWithVMGuestState, then using customer-managed key, it requires that DES encryption type is ConfidentialVmEncryptedWithCustomerKey, else installer throw error as below: 08-31 10:12:54.443 level=error msg=Error: creating Linux Virtual Machine: (Name "jima30confa-vtrm2-bootstrap" / Resource Group "jima30confa-vtrm2-rg"): compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code="BadRequest" Message="The type of the Disk Encryption Set in the request is 'ConfidentialVmEncryptedWithCustomerKey', but this Disk Encryption Set was created with type 'EncryptionAtRestWithCustomerKey'." Target="/subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/jima30confa-vtrm2-rg/providers/Microsoft.Compute/disks/jima30confa-vtrm2-bootstrap_OSDisk" Installer should check vm type and DES's encryption type to make sure that expected DES is set.
Version-Release number of selected component (if applicable):
4.14 nightly build
How reproducible:
Always
Steps to Reproduce:
1. Prepare install-config, 1) enable confidentialVM but use vm type which does not support Confidential VM 2) enable TrustedLaunch but use vm type which support confidentialVM 3) enable confidentialVM + securityEncryptionType: DiskWithVMGuestState, use customer-managed key to encrypt managed key, but customer-managed key's encryption type is the default one "EncryptionAtRestWithPlatformKey" 2. Create cluster 3.
Actual results:
Installation failed when creating infrastructure
Expected results:
Installer should have pre-check for those scenarios, and exit with expected error message.
Additional info:
Kube 1.26 introduced the warning level TopologyAwareHintsDisabled event. TopologyAwareHintsDisabled is fired by the EndpointSliceController whenever reconciling a service that has activated topology aware hints via the service.kubernetes.io/topology-aware-hints annotation, but there is not enough information in the existing cluster resources (typically nodes) to apply the topology aware hints.
When re-basing OpnShift onto Kube 1.26, are CI builds are failing (except on AWS), because these events are firing "pathologically", for example:
: [sig-arch] events should not repeat pathologically
events happened too frequently event happened 83 times, something is wrong: ns/openshift-dns service/dns-default - reason/TopologyAwareHintsDisabled Insufficient Node information: allocatable CPU or zone not specified on one or more nodes, addressType: IPv4 result=reject
AWS nodes seem to have the proper values in the nodes. GCP has the values also, but they are not "right" for the purposes of the EndpointSliceController:
event happened 38 times, something is wrong: ns/openshift-dns service/dns-default - reason/TopologyAwareHintsDisabled Unable to allocate minimum required endpoints to each zone without exceeding overload threshold (5 endpoints, 3 zones), addressType: IPv4 result=reject }
https://github.com/openshift/origin/pull/27666 will mask this problem (make it stop erroring in CI) but changes still need to be made in the product so end users are not subjected to these events.
Now links to:
[sig-arch] events should not repeat pathologically for ns/openshift-dns
Description of problem:
Bootstrap process failed due to API_URL and API_INT_URL are not resolvable: Feb 06 06:41:49 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Main process exited, code=exited, status=1/FAILURE Feb 06 06:41:49 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Failed with result 'exit-code'. Feb 06 06:41:49 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Consumed 1min 457ms CPU time. Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 1. Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: Stopped Bootstrap a Kubernetes cluster. Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: bootkube.service: Consumed 1min 457ms CPU time. Feb 06 06:41:54 yunjiang-dn16d-657jf-bootstrap systemd[1]: Started Bootstrap a Kubernetes cluster. Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Check if API and API-Int URLs are resolvable during bootstrap Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Checking if api.yunjiang-dn16d.qe.gcp.devcluster.openshift.com of type API_URL is resolvable Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Starting stage resolve-api-url Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Unable to resolve API_URL api.yunjiang-dn16d.qe.gcp.devcluster.openshift.com Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Checking if api-int.yunjiang-dn16d.qe.gcp.devcluster.openshift.com of type API_INT_URL is resolvable Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Starting stage resolve-api-int-url Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Unable to resolve API_INT_URL api-int.yunjiang-dn16d.qe.gcp.devcluster.openshift.com Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8905]: https://localhost:2379 is healthy: successfully committed proposal: took = 7.880477ms Feb 06 06:41:58 yunjiang-dn16d-657jf-bootstrap bootkube.sh[7781]: Starting cluster-bootstrap... Feb 06 06:41:59 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8989]: Starting temporary bootstrap control plane... Feb 06 06:41:59 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8989]: Waiting up to 20m0s for the Kubernetes API Feb 06 06:42:00 yunjiang-dn16d-657jf-bootstrap bootkube.sh[8989]: API is up install logs: ... time="2024-02-06T06:54:28Z" level=debug msg="Unable to connect to the server: dial tcp: lookup api-int.yunjiang-dn16d.qe.gcp.devcluster.openshift.com on 169.254.169.254:53: no such host" time="2024-02-06T06:54:28Z" level=debug msg="Log bundle written to /var/home/core/log-bundle-20240206065419.tar.gz" time="2024-02-06T06:54:29Z" level=error msg="Bootstrap failed to complete: timed out waiting for the condition" time="2024-02-06T06:54:29Z" level=error msg="Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane." ...
Version-Release number of selected component (if applicable):
4.16.0-0.nightly-2024-02-05-184957,openshift/machine-config-operator#4165
How reproducible:
Always.
Steps to Reproduce:
1. Enable custom DNS on gcp: platform.gcp.userProvisionedDNS:Enabled and featureSet:TechPreviewNoUpgrade 2. Create cluster 3.
Actual results:
Failed to complete bootstrap process.
Expected results:
See description.
Additional info:
I believe 4.15 is affected as well once https://github.com/openshift/machine-config-operator/pull/4165 backport to 4.15, currently, it failed at an early phase, see https://issues.redhat.com/browse/OCPBUGS-28969
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.
Affected Platforms:
Is it an
If it is an internal RedHat testing failure:
If it is a CI failure:
If it is a customer / SD issue:
When collecting onprem events, we want to be able to distinguish among the various onprem deployments:
This info we should also make sure we forward it when collecting events
We should also define a human-friendly version for each
Slack thread about the supported deployment types
https://redhat-internal.slack.com/archives/CUPJTHQ5P/p1706209886659329
Environment variables for setting the deployment type (one of: podman, operator, ACM, MCE, ABI) and for setting the release version (if applicable)
Description of problem:
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Description of the problem:
Installation fails to complete for 3+1 while 1 worker in Error
CVO cannot complete installation
1/22/2023, 10:00:33 PMOperator cvo status: progressing message: Unable to apply 4.12.0: the cluster operator machine-api is not available
1/22/2023, 9:56:33 PMOperator cvo status: progressing message: Unable to apply 4.12.0: some cluster operators are not available
1/22/2023, 9:56:33 PMOperator console status: available message: All is well
1/22/2023, 9:55:13 PMUpdated status of the cluster to finalizing
[Detected by regression test: test_delete_host_during_installation_success]
How reproducible:
100%, started a while ago.
Steps to reproduce:
1. Start Install 3+1
2. Once worker start its installation, kill worker's agent
3. Worker got to Error and installation continues
Actual results:
CVO fails to install, eventually timing up and cluster ends with a failure
Expected results:
Cluster completed installation with 1 failed worker
When using an autoscaling MachinePool with OpenStack, setting minReplicas=0 results in a nil pointer panic.
See HIVE-2415 for context.
As a developer of CPMS I want to ensure unhealthy nodes can be replaced so that we can recommend to users to use CPMS
QE have some manual test cases that test a couple of unhappy scenarios for the CPMS, that should result in automatic recovery.
I would like to see these automated as part of the periodic suite for CPMS.
The behaviour itself isn't really dependent on CPMS, but, the whole workflow is.
The behaviour is primarily based on other components and how they react, but block CPMS from operating as expected.
The two cases I would like to see added are:
Description of the problem:
The InfraEnv resource will accept both arm64 and aarch64 as valid cpuArchitectures. Both result in an ISO URL with arm64 in the path. However, supplying the infraEnv with cpuArchitecture arm64 will result in the converged flow becoming stuck because of the metal3 PreprovisioningImage resource only accepts aarch64 as an architecture:
- lastTransitionTime: "2023-10-26T14:46:14Z" message: PreprovisioningImage CPU architecture (aarch64) does not match InfraEnv CPU architecture (arm64) observedGeneration: 2 reason: InfraEnvArchMismatch status: "False" type: Ready - lastTransitionTime: "2023-10-26T14:46:14Z" message: PreprovisioningImage CPU architecture (aarch64) does not match InfraEnv CPU architecture (arm64) observedGeneration: 2 reason: InfraEnvArchMismatch status: "True" type: Error networkData: {}
How reproducible:
100%
Steps to reproduce:
1. Create an infraenv with cpuArchitecture: arm64
2. Create BMH resources with the converged flow enabled
Actual results:
PreprovisioningImages have InfraEnvArchMismatch because it only support aarch64 architecture
Expected results:
InfraEnv only support aarch64 cpuArchitecture or correctly translates arm64 to aarch64.
Workaround
The workaround is just to create the InfraEnv resource with cpuArchitecture: aarch64 instead of arm64
I took a look at Component Readiness today and noticed that "[sig-cluster-lifecycle] cluster upgrade should complete in a reasonable time" is permafailing. I modified the sample start time to see that is appears to have started around February 19th.
Is this expected with 4.16 or do we have a problem?
Component Readiness has found a potential regression in [sig-cluster-lifecycle] cluster upgrade should complete in a reasonable time.
Probability of significant regression: 100.00%
Sample (being evaluated) Release: 4.16
Start Time: 2024-02-27T00:00:00Z
End Time: 2024-03-04T23:59:59Z
Success Rate: 0.00%
Successes: 0
Failures: 4
Flakes: 0
Base (historical) Release: 4.15
Start Time: 2024-02-01T00:00:00Z
End Time: 2024-02-28T23:59:59Z
Success Rate: 100.00%
Successes: 47
Failures: 0
Flakes: 0
David mention this issue here: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1702312628947029
duplicated_event_patterns I think it's creating a blackout range (events ok during time X) and then checks the time range itself, but doesn't appear to exclude the change in counts?
The count of the last event within the allowed range should be subtracted from the first event that is outside of the allowed time range for pathological event test calculation.
David has a demonstration of the count here: https://github.com/openshift/origin/pull/28456 but to fix you have to invert the testDuplicatedEvents to iterate through the event registry, not the the events
OKD's sample operator is using a different set of images, specifically for mysql its importing them from quay.io.
So "Only known images used by tests" test from e2e suite frequently fails.
registry.redhat.io/rhel8/mysql-80:latest from pods: ns/e2e-test-oc-builds-57lj7 pod/database-1-9zdgz node/ip-10-0-95-30.ec2.internal
Along with disruption monitoring via external endpoint we should add in-cluster monitors which run the same checks over:
These tests should be implemented as deployments with anti-affinity landing on different nodes. Deployments are selected so that the nodes could properly be drained. These deployments are writing to host disk and on restart the pod will pick up existing data. When a special configmap is created the pod will stop collecting disruption data.
External part of the test will create deployments (and necessary RBAC objects) when test is started, create stop configmap when it ends and collect data from the nodes. The test will expose them on intervals chart, so that the data could be used to find the source of disruption
primary_ipv4_address is deprecated in favor of primary_ip[*].address. Replace it with the new attribute.
We are currently inheriting labels:
skopeo inspect -n docker://quay.io/redhat-user-workloads/crt-redhat-acm-tenant/hypershift-operator/hypershift-operator-main@sha256:a2e9ad049c260409cb09f82396be70d60efa4ed579ac8f95cb304332b8a9920a | jq -e ".Labels" { "architecture": "x86_64", "build-date": "2023-09-21T19:24:45", "com.redhat.component": "ubi9-minimal-container", "com.redhat.license_terms": "https://www.redhat.com/en/about/red-hat-end-user-license-agreements#UBI", "description": "The Universal Base Image Minimal is a stripped down image that uses microdnf as a package manager. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.", "distribution-scope": "public", "io.buildah.version": "1.31.0", "io.k8s.description": "The Universal Base Image Minimal is a stripped down image that uses microdnf as a package manager. This base image is freely redistributable, but Red Hat only supports Red Hat technologies through subscriptions for Red Hat products. This image is maintained by Red Hat and updated regularly.", "io.k8s.display-name": "Red Hat Universal Base Image 9 Minimal", "io.openshift.expose-services": "", "io.openshift.hypershift.control-plane-operator-applies-management-kas-network-policy-label": "true", "io.openshift.hypershift.control-plane-operator-creates-aws-sg": "true", "io.openshift.hypershift.control-plane-operator-manages-ignition-server": "true", "io.openshift.hypershift.control-plane-operator-manages.cluster-autoscaler": "true", "io.openshift.hypershift.control-plane-operator-manages.cluster-machine-approver": "true", "io.openshift.hypershift.control-plane-operator-manages.decompress-decode-config": "true", "io.openshift.hypershift.control-plane-operator-skips-haproxy": "true", "io.openshift.hypershift.control-plane-operator-subcommands": "true", "io.openshift.hypershift.ignition-server-healthz-handler": "true", "io.openshift.hypershift.restricted-psa": "true", "io.openshift.tags": "minimal rhel9", "maintainer": "Red Hat, Inc.", "name": "ubi9-minimal", "release": "750", "summary": "Provides the latest release of the minimal Red Hat Universal Base Image 9.", "url": "https://access.redhat.com/containers/#/registry.access.redhat.com/ubi9-minimal/images/9.2-750", "vcs-ref": "7ef59505f75bf0c11c8d3addefebee5ceaaf4c41", "vcs-type": "git", "vendor": "Red Hat, Inc.", "version": "9.2" }
Thus, we need to set:
Description of problem:
All Projects' dropdown test is failing in CI
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
Service CA operator creates certificates and secrets to inject cert info into configmaps that request via annotation.
Those secrets and configmaps need to have ownership and description annotations to support cert ownership validation.