ISD On-Prem Production Infrastructure requirements

Identify Kubernetes Environment:

  • Access: Admin access to ONE namespace

  • Compute: In general, Nodes with higher memory are preferred because ISD (with Spinnaker) is memory intensive

    • Minimum: 8CPU, 32GB, 3 nodes, (6 Nodes for max configuration)

    • Preferred: 64/128 GB RAM Nodes

  • Network: Is outbound internet access, http and grpc traffic, http traffic to all cloud-endpoints and artifact repos allowed?

    • If Yes: Proceed with normal installation

    • If No: Choose Air-Gapped installation.

    • Yes, but Proxy access is required for http but grpc is not allowed: Same as No and configure proxies as mentioned below.

  • ISTIO/Service Mesh: If this is in use, additional considerations are required to external access, including DB, cloud-endpoints, artifact and data endpoints. This is to ensure seamless integration.

ISD needs the following databases:

  • Aurora Postgres (e.g. RDS): Recommended Size of the Cluster is “db.r6g.xlarge(4CPU’s & 32gb)” up to Ver 13.3 has been tested. This is used by Autopilot (aka OES). As a starting point we start with an estimated 20GB.

  • S3 Bucket(s): Required for Kayenta (or Verification).

  • Aurora MySql 5.7(2.07.2 or +) RDS: Recommended Size of the Cluster is supposed to be “db.r5.xlarge(4CPU’s & 32gb)” Clouddriver, Orca and Front50. We can have the same instance or 3 separate instances for better performance.

    • Spinnaker Service Level Sizing requirements

      • Front50, 100MB is sufficient.

      • For Orca, it depends on the number of executions including the number of months of data. assuming 5/day * 360 days * 200 pipelines * 5MB/pipeline = 360Gi.

      • For Clouddriver, it depends on the number of namespaces and number of resources. assuming 200 namespaces, 20 deployments/namespace * 5MB/deployment = 20Gi Best Practices of Aurora DB’s Document available here.

  • Elastic Caching Redis(5.0.6): Recommended Size of the Cluster to be “cache.r6g.large” gate+other services. Typically, one redis instance is adequate for all services (gate, fiat, Orca(?)).

Identify Proxy configuration:

Identify the proxy configuration for accessing any resources. The example “JAVA_OPTS” for http.proxyHost and http.noProxyHosts values need to be defined. We need to add all ISD-services to noProxyHosts.

Note: Most proxy-services automatically redirect https to http and vice-versa and proxy the requests. If this is NOT the case, please define https.proxyHost and https.noProxyHosts as well.

Custom CA certificates:

If there are any custom CAs or self-signed CAs that need to be honored, they need to be included in oes-cacerts as mentioned here.

SSO:

Identify the SSO used (SAML(e.g. Okta, Jumpcloud), OAuth, AD/LDAP).

  • Admin User: Create a service-account-user which will act as an admin.

  • Admin group(s): Identify groups that will give admin-rights to users if they belong to any one of the groups.

  • RBAC: Define the groups/roles that are needed for organization.

Note:

  1. In the case of AD/LDAP, configuring the appropriate search-strings might involve a bit of trial-and-error depending on the admin-support available, knowledge of the group-structure, how well structured the groups are and available documentation.

  2. For SAML, we configure it as both IDP(Identity provider) and SP(service provider). Using a separate IDP is not supported by Spinnaker.

URLs, routing and TLS termination:

  • Identify URLs for the application: Three URLs are required at a minimum: isd, isd-gate, and spin.domain-name.com. Two additional URLs may be needed depending on the usage of Separate Spinnaker and Agent based deployments.

  • Decide on how the traffic from the URLs will be routed to the Kubernetes services: Ingress(nginx, other), ISTIO-gw, or LoadBalancer.

  • Decide on where TLS termination will happen: Ingress, Load Balancer, or gate+UI.

  • Decide on how the TLS certificates will be created: cert-manager, Cloud(e.g AWS), or custom certificates.

Secrets handling:

  • ISD support for git-encrypted, k8s, and Vault secrets are, already built-in.

  • Decide where do we want to store secrets: git-encrypted, k8s, vault or other (e.g. AWS Secret Manager, Azure KV, CyberArk, etc.).

  • Should any customization be required, this needs to be included in the helm chart.

  • Decide where we want to store kubeconfig files and other secrets “files” and make them available to the application.

Note: A Clouddriver-Sidecar to ensure kubeconfig files are handled properly if they are stored in a different secret store. Sidecar for k8s and Vault is built-in.

Component Sizing:

Depending on the assessment of the number of cloud accounts, number of objects, etc. an initial sizing will be recommended. Note that in a micro-services architecture, sizing of each component involves a bit of trial-and-error as it depends on the usage pattern. It is best to start with larger values and reduce them to avoid surprises. If cost-saving is a key consideration, start with lower values and increase them based on experience and usage metrics.

Monitoring and alerting:

Prometheus bases alerting is available by enabling “central monitoring” in the values.yaml. If we want to use any other monitoring, this will need to be configured separately. Typically, the customer’s infrastructure team takes care of this based on the Prometheus alerts that we provide.

Backup and Disaster Recovery:

  • Data backup is typically done via the cloud provider. Should we decide to do it ourselves, pipelines need to be set up to back up the data at regular intervals.

  • Pipeline backup is done via a cron trigger for the syncToGit pipeline in opsmx-gitops application.

Last updated