Jump to: Complete Features | Incomplete Features | Complete Epics | Other Complete |
Note: this page shows the Feature-Based Change Log for a release
These features were completed when this image was assembled
On new installations, we should make the StorageClass created by the CSI operator the default one.
However, we shouldn't do that on an upgrade scenario. The main reason is that users might have set a different quota on the CSI driver Storage Class.
Exit criteria:
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Rebase openshift-controller-manager to k8s 1.24
Assumption
Doc: https://docs.google.com/document/d/1sXCaRt3PE0iFmq7ei0Yb1svqzY9bygR5IprjgioRkjc/edit
Customers do not pay Red Hat more to run HyperShift control planes and supporting infrastructure than Standalone control planes and supporting infrastructure.
Assumption
Run cluster-storage-operator (CSO) + AWS EBS CSI driver operator + AWS EBS CSI driver control-plane Pods in the management cluster, run the driver DaemonSet in the hosted cluster.
More information here: https://docs.google.com/document/d/1sXCaRt3PE0iFmq7ei0Yb1svqzY9bygR5IprjgioRkjc/edit
As HyperShift Cluster Instance Admin, I want to run AWS EBS CSI driver operator + control plane of the CSI driver in the management cluster, so the guest cluster runs just my applications.
Exit criteria:
Enable sharing ConfigMap and Secret across namespaces
Requirement | Notes | isMvp? |
---|---|---|
Secrets and ConfigMaps can get shared across namespaces | YES |
NA
NA
Consumption of RHEL entitlements has been a challenge on OCP 4 since it moved to a cluster-based entitlement model compared to the node-based (RHEL subscription manager) entitlement mode. In order to provide a sufficiently similar experience to OCP 3, the entitlement certificates that are made available on the cluster (OCPBU-93) should be shared across namespaces in order to prevent the need for cluster admin to copy these entitlements in each namespace which leads to additional operational challenges for updating and refreshing them.
Questions to be addressed:
* What educational or reference material (docs) is required to support this product feature? For users/admins? Other functions (security officers, etc)?
* Does this feature have doc impact?
* New Content, Updates to existing content, Release Note, or No Doc Impact
* If unsure and no Technical Writer is available, please contact Content Strategy.
* What concepts do customers need to understand to be successful in [action]?
* How do we expect customers will use the feature? For what purpose(s)?
* What reference material might a customer want/need to complete [action]?
* Is there source material that can be used as reference for the Technical Writer in writing the content? If yes, please link if available.
* What is the doc impact (New Content, Updates to existing content, or Release Note)?
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
As an OpenShift cluster admin
I want to use the TechPreview feature set to try out CSI volumes in builds
So that I can test using CSI volumes in builds through tech preview features
Product documentation will not be required until BUILD-275 is complete. Documentation for CSI volumes in builds will need to note that the TechPreviewNoUpgrade feature set needs to be enabled on the cluster.
Additional training enablement materials may not be needed - product docs should be sufficient.
Full e2e testing may not be feasible until BUILD-275 is completed.
CI testing should verify that the appropriate configuration values were passed to the build controller.
We will likely need a new CI job that installs the cluster with tech preview enabled before we verify that the BuildCSIVolumes feature gate has been enabled.
OpenShift already has feature gates baked into the core platform via the FeatureGate API object. For this feature, we need to declare a feature gate that is added to the TechPreviewNoUpgrade feature set, which openshift-controller-manager-operator then reads and applies to the build controller.
Feature gate needs to be proposed to openshift/api (add to the TechPreviewNoUpgrade feature set).
An example PR on how to do this: https://github.com/openshift/api/pull/982.
Once approved, the updated tech preview feature set needs to be vendored into openshift/library-go.
Openshift-controller-manager-operator needs to read the feature gate, pass it on to the build controller.
The build controller has its own configuration "API" - this was a relic of the 3.x master configuration that is not exposed to admins in OCP 4.x: https://github.com/openshift/api/blob/master/openshiftcontrolplane/v1/types.go#L198-L207
A separate operator looks checks if a *NoUpgrade feature set is enabled, and if so marks the cluster as unable to be upgraded to the next minor OCP
release.
To test this in CI, we need a suite that runs with the TechPreviewNoUpgrade feature set enabled. The step registry has primitives which bring up a cluster with tech preview features enabled. We will need to update ocm-o's CI configuration to run our operator tests with tech preview enabled. Testing for this specific feature will need to have separate logic that verifies we are sending the right configuration to the build controller under normal and TechPreview mode.
Existing techpreview CI step registry setups (note the per cloud elements, which make sense, since the existing CSI drivers are per cloud):
/ci-operator/step-registry/ipi/aws/pre/techpreview
./ci-operator/step-registry/ipi/azure/pre/techpreview
./ci-operator/step-registry/ipi/conf/aws/techpreview
./ci-operator/step-registry/ipi/conf/azure/techpreview
./ci-operator/step-registry/ipi/conf/techpreview
./ci-operator/step-registry/ipi/conf/openstack/techpreview
./ci-operator/step-registry/ipi/openstack/pre/techpreview
./ci-operator/step-registry/openshift/e2e/aws/techpreview
./ci-operator/step-registry/openshift/e2e/gcp/techpreview
./ci-operator/step-registry/openshift/e2e/azure/techpreview
./ci-operator/step-registry/openshift/e2e/openstack/techpreview
./ci-operator/step-registry/openshift/e2e/vsphere/techpreview
Given shared resources span all clouds, etc. does that mean we touch each of these, or create a new one, or both?
When this image was assembled, these features were not yet completed. Therefore, only the Jira Cards included here are part of this release
Goal:
As an administrator, I would like to deploy OpenShift 4 clusters to AWS C2S region
Problem:
Customers were able to deploy to AWS C2S region in OCP 3.11, but our global configuration in OCP 4.1 doesn't support this.
Why is this important:
Lifecycle Information:
Previous Work:**
Here are the relevant PRs from OCP 3.11. You can see that these endpoints are not part of the standard SDK (they use an entirely separate SDK). To support these regions the endpoints had to be configured explicitly.
Seth Jennings has put together a highly customized POC.
Dependencies:
Prioritized epics + deliverables (in scope / not in scope):
Related : https://jira.coreos.com/browse/CORS-1271
Estimate (XS, S, M, L, XL, XXL): L
Customers: North America Public Sector and Government Agencies
Open Questions:
We need to continue to maintain specific areas within storage, this is to capture that effort and track it across releases.
Goals
Requirements
Requirement | Notes | isMvp? |
---|---|---|
Telemetry | No | |
Certification | No | |
API metrics | No | |
Out of Scope
n/a
Background, and strategic fit
With the expected scale of our customer base, we want to keep load of customer tickets / BZs low
Assumptions
Customer Considerations
Documentation Considerations
Notes
In progress:
High prio:
Unsorted
Traditionally we did these updates as bugfixes, because we did them after the feature freeze (FF). Trying no-feature-freeze in 4.12. We will try to do as much as we can before FF, but we're quite sure something will slip past FF as usual.
Update all CSI sidecars to the latest upstream release.
This includes update of VolumeSnapshot CRDs in https://github.com/openshift/cluster-csi-snapshot-controller-operator/tree/master/assets
Update all OCP and kubernetes libraries in storage operators to the appropriate version for OCP release.
This includes (but is not limited to):
Operators:
This section includes Jira cards that are linked to an Epic, but the Epic itself is not linked to any Feature. These epics were completed when this image was assembled
Monitoring needs to be reliable and is the very useful when trying to debug clusters in an already degraded state. We want to ensure that metrics scraping can always work if the scraper can reach the target, even if the kube-apiserver is unavailable or unreachable. To do this, we will combine a local authorizer (already merged in many binaries and the rbac-proxy) and client-cert based authentication to have a fully local authentication and authorization path for scraper targets.
If networking (or part of networking) is down and a scraper target cannot reach the kube-apiserver to verify a token and a subjectaccessreview, then the metrics scraper can be rejected. The subjectaccessreview (authorization) is already largely addressed, but service account tokens are still used for scraping targets. Tokens require an external network call that we can avoid by using client certificates. Gathering metrics, especially client metrics, from partially functionally clusters helps narrow the search area between kube-apiserver, etcd, kubelet, and SDN considerably.
In addition, this will significantly reduce the load on the kube-apiserver. We have observed in the CI cluster that token and subject access reviews are a significant percentage of all kube-apiserver traffic.
User story:
As cluster-policy-controller I automatically approve cert signing requests issued by monitoring.
DoD:
Implementation hints: leverage approving logic implemented in https://github.com/openshift/library-go/pull/1083.
As an cluster administrator of a disconnected OCP cluster,
I want a list of possible sample images to mirror
So that I can configure my image mirror prior to installing OCP in a disconnected environment.
it is too onerous to find a connected cluster in order to obtain the list of possible samples images to mirror using the current documented procedures.
I need a list make available to me in my disconnected cluster that I can reference after initial install.
Sample operator use of the reason field in its config object to track imagestream import completion has resulted in that singleton being a bottleneck and source of update conflicts (we are talking 60 or 70 imagestreams potentially updating that once field concurrently).
See this hackday PR
for an alternative approach which uses a configmap per imagestream
Goal:
Record the lastImageChangeTriggeredID field in the BuildConfig status, instead of spec.
Problem:
BuildConfigs use the lastImageChangeTriggeredID field to record the SHA of the last image used to trigger a build. This information should only be managed by the Build controllers and be placed in the BuildConfig status. By keeping the data in spec, cluster admins and developers can easily alter or remove this information.
Why is this important?
Moving this information to status is a prerequisite for using GitOps to manage BuildConfigs with ImageChange triggers.
Dependencies
None
Stories and Deliverables
Estimate: (XS, S, M, L, XL, XXL) S
Previous Work:
https://bugzilla.redhat.com/show_bug.cgi?id=1876500
User Stories:
As an OpenShift engineer
I want to image change triggers to record their information BuildConfig status
So that eventually changes to ImageChange triggers in the BuildConfig spec do not cause builds to launch
As a developer using OpenShift to build images
I want to trigger a build when the base image of my application changes
So that bug fixes and security patches are applied to my application.
As an OpenShift cluster admin
I want to be alerted if something cleared the LastImageTriggeredID in a BuildConfig
So that I am aware that cluster users are relying on deprecated behavior.
Success Criteria:
1. Image change triggers record lastImageTriggeredID in the BuildConfig's status
2. Customers are notified that lastImageTriggeredID is deprecated in BuildConfig's spec
3. Customers are alerted if their users are relying on deprecated behavior
Open Questions:
See https://docs.google.com/document/d/1thcVR31TElJBXBGmyaYXZ9jRejp5W4-ppNYn4g3_mgk/edit# for a fuller explanation on how to use this template.
As an OpenShift engineer
I want to image change triggers to record their information BuildConfig status
So that eventually changes to ImageChange triggers in the BuildConfig spec do not cause builds to launch
As a developer using OpenShift to build images
I want to trigger a build when the base image of my application changes
So that bug fixes and security patches are applied to my application.
Release note should inform customers that the lastImageChangeTriggeredID field in the BuildConfig spec is deprecated, and will be ignored in OCP 4.9. Users relying on this information should update their scripts and jobs to read the triggered image ID from BuildConfig status.
Please see https://issues.redhat.com/browse/RHDEVDOCS-2738 for more details.
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
If lastImageTriggeredID is set to the empty string or nil, today this will trigger a build. By moving the data to status, we can ignore the lastImageTriggeredID field in spec in a future release. To give users proper notice, we are not going to remove the current behavior until a later release.
A deprecation notice will need to be added to the release notes upon completion.
Changes will be needed in the following:
1. API
2. openshift-apiserver
3. openshift-controller-manager (build controller)
See https://bugzilla.redhat.com/show_bug.cgi?id=1876500.
Goal: Support OCI images.
Problem: Buildah and podman use OCI format by default and OpenShift Image Registry and ImageStream API doesn't understand it.
Why is this important: OCI images are supposed to replace Docker schema 2 images, OpenShift should be ready when OCI images become widely adopted.
Dependencies (internal and external):
Prioritized epics + deliverables (in scope / not in scope):
Estimate (XS, S, M, L, XL, XXL): XL
Previous Work:
Customers:
Open Questions:
As a user of OpenShift
I want to push OCI images to the registry
So that I can use buildah and podman with their defaults to push images
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Add pertinent notes here:
https://github.com/openshift/origin/pull/25475/files marked our tests for ISI as Disruptive.
Tests should wait until operators become stable, otherwise other tests will be run on an unstable cluster and it'll cause flakes.
The integration tests for the image registry expect that OpenShift and tests are run on the same machine (i.e. OpenShift can connect to sockets that the tests listen). This is not the case with e2e tests.
Goal: Rebase registry to Docker Distribution
Problem: The registry is currently based on an outdated version of the upstream docker/distribution project. The base does not even have a version associated with it - DevEx last rebased on an untagged commit.
Why is this important: Update the registry with improvements and bug fixes from the upstream community.
Dependencies (internal and external):
Prioritized epics + deliverables (in scope / not in scope):
Estimate (XS, S, M, L, XL, XXL): M
Previous Work:
Customers:
Open questions:
As a user of OpenShift
I want the image registry to be rebased on the latest docker/distribution release (v2.7.1)
So that the image registry has the latest upstream bugfixes and enhancements
Dependencies identified
Blockers noted and expected delivery timelines set
Design is implementable
Acceptance criteria agreed upon
Story estimated
Add pertinent notes here:
This section includes Jira cards that are not linked to either an Epic or a Feature. These tickets were completed when this image was assembled
Description of problem:
oc --context build02 get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.12.0-ec.1 True False 45h Error while reconciling 4.12.0-ec.1: the cluster operator kube-controller-manager is degraded oc --context build02 get co kube-controller-manager NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE kube-controller-manager 4.12.0-ec.1 True False True 2y87d GarbageCollectorDegraded: error fetching rules: Get "https://thanos-querier.openshift-monitoring.svc:9091/api/v1/rules": dial tcp 172.30.153.28:9091: connect: cannot assign requested address
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
build02 is a build farm cluster in CI production.
I can provide credentials to access the cluster if needed.
discover-etcd-initial-cluster was written very early on in the cluster-etcd-operator life cycle. We have observed at least one bug in this code and in order to validate logical correctness it needs to be rewritten with unit tests.
With CSISnapshot capability is disabled, all CSI driver operators are Degraded. For example AWS EBS CSI driver operator during installation:
18:12:16.895: Some cluster operators are not ready: storage (Degraded=True AWSEBSCSIDriverOperatorCR_AWSEBSDriverStaticResourcesController_SyncError: AWSEBSCSIDriverOperatorCRDegraded: AWSEBSDriverStaticResourcesControllerDegraded: "volumesnapshotclass.yaml" (string): the server could not find the requested resource AWSEBSCSIDriverOperatorCRDegraded: AWSEBSDriverStaticResourcesControllerDegraded: ) Ginkgo exit error 1: exit with code 1}
Version-Release number of selected component (if applicable):
4.12.nightly
The reason is that cluster-csi-snapshot-controller-operator does not create VolumeSnapshotClass CRD, which AWS EBS CSI driver operator expects to exist.
CSI driver operators must skip VolumeSnapshotClass creation if the CRD does not exist.
Description of problem:
In looking at jobs on an accepted payload at https://amd64.ocp.releases.ci.openshift.org/releasestream/4.12.0-0.ci/release/4.12.0-0.ci-2022-08-30-122201 , I observed this job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.12-e2e-aws-sdn-serial/1564589538850902016 with "Undiagnosed panic detected in pod" "pods/openshift-controller-manager-operator_openshift-controller-manager-operator-74bf985788-8v9qb_openshift-controller-manager-operator.log.gz:E0830 12:41:48.029165 1 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)"
Version-Release number of selected component (if applicable):
4.12
How reproducible:
probably relatively easy to reproduce (but not consistently) given it's happened several times according to this search: https://search.ci.openshift.org/?search=Observed+a+panic%3A+%22invalid+memory+address+or+nil+pointer+dereference%22&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Steps to Reproduce:
1. let nightly payloads run or run one of the presubmit jobs mentioned in the search above 2. 3.
Actual results:
Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)}
Expected results:
no panics
Additional info: