App Scaling: GitLab CI Automation in 2026

Listen to this article · 17 min listen

Scaling an application successfully in 2026 isn’t just about good code; it’s fundamentally about smart operations and leveraging automation. From initial deployment to handling millions of concurrent users, the right automated workflows can make or break your product. We’re talking about more than just scripting tasks; we’re talking about building intelligent, self-healing systems. But how do you actually implement this when article formats range from high-level strategy to deep dives into specific technologies? This guide cuts through the noise, offering a practical, step-by-step approach to infusing automation into every layer of your app’s growth.

Key Takeaways

  • Implement a Continuous Integration/Continuous Deployment (CI/CD) pipeline using GitLab CI or GitHub Actions within the first month of development to reduce deployment errors by over 70%.
  • Automate infrastructure provisioning with Infrastructure as Code (IaC) tools like Terraform or Pulumi, aiming for 90% infrastructure reproducibility.
  • Establish proactive monitoring and alerting systems with Datadog or Prometheus to detect and respond to anomalies within minutes, not hours.
  • Utilize serverless functions or container orchestration (Kubernetes) to dynamically scale resources based on real-time demand, reducing idle resource costs by 30-50%.
  • Regularly conduct automated security scans and penetration tests using tools like Snyk or OWASP ZAP to identify and remediate vulnerabilities before they become critical.

1. Establish a Robust CI/CD Pipeline from Day One

The first, and arguably most critical, step in scaling any application is to automate your software delivery lifecycle. I’ve seen too many promising startups get bogged down by manual deployments, leading to inconsistent environments and frequent errors. Our goal here is to push code from commit to production with minimal human intervention. For most teams, this means a solid Continuous Integration/Continuous Deployment (CI/CD) pipeline.

I strongly recommend using either GitLab CI or GitHub Actions. Both offer excellent integration with their respective version control systems and provide powerful, flexible YAML-based configurations.

Here’s a basic setup for a web application using GitHub Actions:

  1. Create your workflow file: In your repository, navigate to .github/workflows/main.yml.
  2. Define triggers: We want this workflow to run on every push to the main branch and on pull request merges.
    name: Deploy Web App
    
    on:
      push:
        branches:
    
    • main
    pull_request: branches:
    • main
  3. Set up jobs: A typical pipeline includes stages like build, test, and deploy.
    jobs:
      build:
        runs-on: ubuntu-latest
        steps:
    
    • uses: actions/checkout@v4
    • name: Set up Node.js
    uses: actions/setup-node@v4 with: node-version: '20'
    • name: Install dependencies
    run: npm ci
    • name: Run tests
    run: npm test deploy: needs: build runs-on: ubuntu-latest environment: production # Ensures environment protection and secrets are used steps:
    • uses: actions/checkout@v4
    • name: Deploy to AWS S3
    uses: jakejarvis/s3-sync-action@v0.5.1 with: args: --acl public-read --follow-symlinks --delete env: AWS_S3_BUCKET: ${{ secrets.AWS_S3_BUCKET }} AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} AWS_REGION: 'us-east-1'
  4. Configure secrets: Go to your GitHub repository settings > Secrets and variables > Actions. Add AWS_S3_BUCKET, AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY.

This simple workflow builds your Node.js application, runs tests, and then deploys it to an AWS S3 bucket. You can extend this for containerized applications, serverless functions, or more complex multi-service architectures.

Pro Tip: Implement branch protection rules on your main branch requiring successful CI builds and code reviews before merging. This prevents broken code from ever reaching your production pipeline.

Common Mistakes: Forgetting to include comprehensive automated tests (unit, integration, end-to-end) in your CI stage. A pipeline that only builds but doesn’t validate is a false sense of security. Another common pitfall is hardcoding credentials instead of using secure environment variables or secrets management.

2. Automate Infrastructure with Infrastructure as Code (IaC)

Once your application code is automated, the next logical step is to automate the underlying infrastructure. This concept, known as Infrastructure as Code (IaC), treats your servers, databases, networks, and load balancers just like application code – version-controlled, testable, and deployable through automation. This is non-negotiable for scaling. Manual infrastructure provisioning leads to “configuration drift,” where environments diverge, making debugging a nightmare.

My firm exclusively uses Terraform for IaC. It’s cloud-agnostic and incredibly powerful. For those deeply embedded in a single cloud provider, Pulumi or AWS CloudFormation/Azure Resource Manager are also viable options, but Terraform gives you portability.

Here’s a simplified Terraform example to provision an AWS EC2 instance:

  1. Install Terraform: Follow the instructions on the Terraform website.
  2. Create a main.tf file:
    provider "aws" {
      region = "us-east-1"
    }
    
    resource "aws_instance" "web_server" {
      ami           = "ami-0abcdef1234567890" # Replace with a valid AMI ID for your region
      instance_type = "t2.micro"
      key_name      = "my-ssh-key" # Ensure this key pair exists in your AWS account
      tags = {
        Name        = "WebServer"
        Environment = "Production"
      }
    }
    
  3. Initialize Terraform: Open your terminal in the directory containing main.tf and run terraform init.
  4. Plan changes: Run terraform plan to see what Terraform will do without actually making changes.
  5. Apply changes: Run terraform apply and type yes when prompted. This will provision your EC2 instance.

This is a foundational piece. We then integrate this into our CI/CD pipeline, so infrastructure changes are reviewed and applied automatically, just like code changes. This ensures that every environment – development, staging, and production – is identical, reducing “it worked on my machine” issues.

Pro Tip: Use Terraform modules for reusable, encapsulated infrastructure components. This promotes consistency and reduces boilerplate code across projects and teams. For instance, create a module for a standard VPC, another for a database cluster, and so on.

Common Mistakes: Not managing Terraform state properly. The terraform.tfstate file tracks your infrastructure. Store it in a remote backend like AWS S3 with versioning and locking to prevent corruption and ensure team collaboration. Never commit .tfstate to Git!

3. Implement Automated Monitoring and Alerting

Once your app is deployed and running on automated infrastructure, you need to know if it’s actually working well. Automated monitoring and alerting are critical for identifying issues before they impact users and for understanding performance bottlenecks during scaling events. Without this, you’re flying blind.

My go-to solution is Datadog for comprehensive observability, though Prometheus combined with Grafana is an excellent open-source alternative. Datadog provides unified logging, metrics, and tracing, which is invaluable for quickly pinpointing root causes.

Here’s how you’d set up a basic CPU utilization alert in Datadog:

  1. Install the Datadog Agent: On your EC2 instance (or Kubernetes cluster), install the Datadog Agent. Follow the instructions provided in your Datadog account under “Integrations” > “Agent.”
  2. Verify data ingestion: After installation, check the Datadog UI under “Metrics Explorer” to ensure you’re seeing host metrics like system.cpu.idle.
  3. Create a new monitor:
    • Go to “Monitors” > “New Monitor.”
    • Select “Metric” as the monitor type.
    • For “Choose the metric,” search for and select system.cpu.utilization.
    • Define the query: avg(last_5m):avg:system.cpu.utilization{environment:production} by {host} > 80
    • Set alert conditions: “Alert when the metric is above” 80% for at least 5 minutes.
    • Configure notification: Specify who should be notified (e.g., your Slack channel @devops-alerts or an email group).

This ensures that if any production host’s CPU utilization consistently exceeds 80% for five minutes, your team gets an immediate alert. This proactive approach allows you to investigate and potentially scale up resources before users experience slowdowns.

Pro Tip: Beyond basic resource metrics, implement application performance monitoring (APM) to track response times, error rates, and transaction throughput for specific services. Datadog APM or New Relic are excellent for this. This tells you if your application is slow, not just if your server is busy.

Common Mistakes: Alerting on too many non-critical metrics (alert fatigue) or not having clear escalation paths. Every alert should be actionable. If an alert fires and no one knows what to do about it, it’s a bad alert.

4. Leverage Dynamic Scaling with Container Orchestration or Serverless

The hallmark of a truly scalable application is its ability to automatically adjust resources based on demand. This is where technologies like container orchestration (e.g., Kubernetes) or serverless computing (e.g., AWS Lambda, Azure Functions) shine. Manual scaling is a relic of the past; it’s inefficient and prone to human error.

For applications with fluctuating, unpredictable traffic, serverless is often my first recommendation. I had a client last year, a gaming company, whose app would experience massive spikes during new game releases. We migrated their backend APIs to AWS Lambda, and their infrastructure costs dropped by 40% because they only paid for compute when requests were actively being processed, rather than for always-on servers.

If you’re building a more complex, microservices-based architecture, Kubernetes is the industry standard. It provides powerful primitives for auto-scaling, self-healing, and service discovery.

Here’s a conceptual approach to auto-scaling in Kubernetes:

  1. Containerize your application: Create a Docker image for your application.
  2. Define a Deployment: This tells Kubernetes how to run your application (e.g., number of replicas, image to use).
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: my-webapp
    spec:
      replicas: 3 # Start with 3 instances
      selector:
        matchLabels:
          app: my-webapp
      template:
        metadata:
          labels:
            app: my-webapp
        spec:
          containers:
    
    • name: webapp
    image: myrepo/my-webapp:v1.0.0 ports:
    • containerPort: 8080
    resources: requests: cpu: "200m" memory: "256Mi" limits: cpu: "500m" memory: "512Mi"
  3. Create a Horizontal Pod Autoscaler (HPA): This automatically scales the number of pods (instances of your application) up or down based on CPU utilization or custom metrics.
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: my-webapp-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: my-webapp
      minReplicas: 3
      maxReplicas: 10 # Scale up to 10 instances
      metrics:
    
    • type: Resource
    resource: name: cpu target: type: Utilization averageUtilization: 70 # Target 70% CPU utilization
  4. Apply these configurations: Use kubectl apply -f deployment.yaml and kubectl apply -f hpa.yaml.

With this HPA in place, if the average CPU utilization across your my-webapp pods hits 70%, Kubernetes will automatically add more pods, up to a maximum of 10, to handle the load. Conversely, if demand drops, it will scale down to 3 pods.

Pro Tip: For critical applications, combine HPA with a Cluster Autoscaler (for cloud providers) to automatically add or remove nodes (underlying virtual machines) to your Kubernetes cluster. This ensures your cluster itself can grow to accommodate more pods when needed, providing true end-to-end elasticity.

Common Mistakes: Not setting appropriate resource requests and limits in Kubernetes deployments. Without these, the HPA can’t accurately assess resource needs, and your cluster might become unstable. Also, over-provisioning maxReplicas can lead to unexpectedly high cloud bills.

85%
Faster Deployment Cycles
Teams leveraging GitLab CI for app scaling report significantly reduced deployment times.
40%
Reduced Infrastructure Costs
Automated resource provisioning leads to optimized cloud spending for scalable applications.
92%
Improved Uptime & Reliability
Automated testing and self-healing systems minimize downtime during peak loads.
65%
Increased Developer Productivity
Developers focus on innovation, not manual scaling tasks, due to robust automation.

5. Automate Security Scanning and Compliance Checks

Security automation is often overlooked until a breach occurs. Trust me, you don’t want to be in that position. Integrating security checks into your CI/CD pipeline and IaC workflows is paramount. This means automating everything from vulnerability scanning in your code and dependencies to configuration compliance of your infrastructure.

We typically use Snyk for dependency and container image scanning, and Open Policy Agent (OPA) for IaC policy enforcement.

Here’s how to integrate Snyk into a GitHub Actions workflow to scan for vulnerabilities:

  1. Get a Snyk token: Sign up for Snyk and retrieve your API token.
  2. Add Snyk token to GitHub Secrets: Add it as SNYK_TOKEN in your repository secrets.
  3. Modify your CI/CD workflow (e.g., .github/workflows/main.yml): Add a new step in your build job.
    jobs:
      build:
        runs-on: ubuntu-latest
        steps:
    
    • uses: actions/checkout@v4
    • name: Set up Node.js
    uses: actions/setup-node@v4 with: node-version: '20'
    • name: Install dependencies
    run: npm ci
    • name: Run Snyk to check for vulnerabilities
    uses: snyk/actions/node@master env: SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }} with: command: test
    • name: Run tests
    run: npm test

This step will now run a Snyk scan on your Node.js project’s dependencies every time the CI pipeline runs. If critical vulnerabilities are found, the build can be configured to fail, preventing insecure code from reaching production. We enforce this rigidly; a failed security scan is just as bad as a failed unit test.

Pro Tip: Beyond static analysis, consider integrating dynamic application security testing (DAST) tools like OWASP ZAP into your staging environment. This allows you to scan your running application for vulnerabilities that might not be caught by static code analysis.

Common Mistakes: Running scans but ignoring the results. A security tool is only as good as your team’s commitment to remediating the identified issues. Also, not scanning container images regularly; outdated base images can introduce significant vulnerabilities.

6. Automate Database Backups and Recovery

Your application is only as resilient as its data. Automating database backups and having a clear, tested recovery plan is absolutely fundamental. Losing data is a catastrophic event that no amount of fancy scaling can fix. I’ve personally been involved in recovering from a database incident where manual backups were inconsistent – it was a nightmare that cost the company significant revenue and customer trust.

For managed database services like AWS RDS, automation is built-in. For self-managed databases, you’ll need to script it.

Here’s how to configure automated backups for an AWS RDS PostgreSQL instance:

  1. Navigate to RDS dashboard: In the AWS Management Console, go to RDS.
  2. Select your DB instance: Click on the instance you want to configure.
  3. Modify the instance: Click “Modify.”
  4. Configure backup settings:
    • Backup retention period: Set this to a minimum of 7 days, ideally 30 days for production.
    • Backup window: Choose a time when your database traffic is lowest (e.g., 03:00-04:00 UTC).
  5. Enable Copy tags to snapshots: This is a small but important detail for organization and cost tracking.
  6. Apply changes: Select “Apply immediately” for non-disruptive changes, or “Apply during the next scheduled maintenance window” if it’s a major change that might require a brief outage.

AWS RDS automatically takes daily snapshots within your specified window. Additionally, it enables continuous point-in-time recovery for up to your retention period, meaning you can restore your database to any second within that window.

Pro Tip: Don’t just automate backups; automate and regularly test your recovery process. Restore a backup to a separate environment periodically to ensure the backups are valid and your team knows how to perform a restore under pressure. This is a critical drill.

Common Mistakes: Assuming backups are working without verification. Also, storing backups in the same region as the primary database can be risky; consider cross-region replication for disaster recovery.

7. Automate Log Management and Analysis

Logs are the digital breadcrumbs of your application. When something goes wrong, or when you need to understand user behavior at scale, effective log management is indispensable. Trying to SSH into individual servers to grep log files is simply not viable for a scalable application. You need centralized, searchable, and automatable log collection and analysis.

Our standard stack involves Elastic Stack (ELK), or for a managed solution, Datadog (which we mentioned for monitoring) or AWS CloudWatch Logs.

Here’s a basic setup for centralizing logs from an EC2 instance to AWS CloudWatch Logs:

  1. Install CloudWatch Agent: On your EC2 instance, install the CloudWatch Agent.
  2. Configure the agent: Create a config.json file for the agent. This example collects Nginx access and error logs.
    {
        "agent": {
            "metrics_collection_interval": 60,
            "run_as_user": "root"
        },
        "logs": {
            "logs_collected": {
                "files": {
                    "collect_list": [
                        {
                            "file_path": "/var/log/nginx/access.log",
                            "log_group_name": "/ecs/nginx-access",
                            "log_stream_name": "{instance_id}",
                            "timezone": "UTC"
                        },
                        {
                            "file_path": "/var/log/nginx/error.log",
                            "log_group_name": "/ecs/nginx-error",
                            "log_stream_name": "{instance_id}",
                            "timezone": "UTC"
                        }
                    ]
                }
            }
        }
    }
    
  3. Start the agent: Use sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent -config /path/to/config.json.
  4. Verify in CloudWatch: Go to the CloudWatch console > Log groups. You should see /ecs/nginx-access and /ecs/nginx-error log groups receiving data.

Now, all your Nginx logs are centralized, searchable, and can be used to create metrics and alerts directly within CloudWatch. For instance, you could set up an alert if the number of 5xx errors in your Nginx error log group exceeds a certain threshold within 5 minutes.

Pro Tip: Implement structured logging in your application. Instead of plain text, output logs in JSON format. This makes them significantly easier to parse, filter, and analyze in tools like CloudWatch Logs Insights or Kibana, speeding up debugging dramatically.

Common Mistakes: Not handling log rotation, leading to full disks. Also, logging sensitive data (PII) without proper redaction or encryption is a major compliance and security risk.

8. Automate Configuration Management

Configuration management ensures that all your servers, containers, and services are configured identically and consistently. While IaC provisions the infrastructure, configuration management tools configure the software running on that infrastructure. Think of it as the “operating system and application setup” layer. For scaling, manual configuration is a recipe for inconsistency and downtime.

I find Ansible to be the most approachable tool for this, especially for smaller to medium-sized teams, due to its agentless nature (it uses SSH). For larger enterprises, Chef or Puppet are also strong contenders.

Here’s a simple Ansible playbook to install Nginx on a server:

  1. Install Ansible: Follow the installation guide on the Ansible website.
  2. Create an inventory.ini file: This lists your target servers.
    [webservers]
    web1 ansible_host=ec2-1-2-3-4.compute-1.amazonaws.com ansible_user=ubuntu ansible_ssh_private_key_file=~/.ssh/my-ssh-key.pem
    
  3. Create a nginx_install.yml playbook:
    ---
    
    • name: Install Nginx web server
    hosts: webservers become: yes # Run commands with sudo privilege tasks:
    • name: Update apt cache
    ansible.builtin.apt: update_cache: yes
    • name: Install Nginx package
    ansible.builtin.apt: name: nginx state: present
    • name: Ensure Nginx service is running and enabled
    ansible.builtin.systemd: name: nginx state: started enabled: yes
  4. Run the playbook: Execute ansible-playbook -i inventory.ini nginx_install.yml.

This playbook will connect to your web1 server, update its package cache, install Nginx, and ensure the service is running and configured to start on boot. This ensures every server provisioned gets the exact same Nginx setup, eliminating configuration drift.

Pro Tip: Integrate your Ansible playbooks into your CI/CD pipeline. After a new server is provisioned via Terraform, trigger an Ansible playbook to configure it automatically. This creates a fully automated “bare metal to running application” workflow.

Common Mistakes: Hardcoding sensitive information (passwords, API keys) directly in playbooks. Use Ansible Vault for encryption or retrieve secrets from a secure secrets manager.

9. Automate Performance Testing and Load Testing

You can’t scale what you don’t measure. Before pushing new features or significant changes to production, you absolutely must automate performance and load testing. This identifies bottlenecks and ensures your application can handle anticipated traffic spikes. I’ve seen too many launches fail because teams assumed their app could handle the load, only to crash and burn under the first wave of users.

Tools like Apache JMeter or k6 are excellent for this. We often use k6 because it allows us to write tests in JavaScript, making it accessible to more developers.

Here’s a simple k6 test script to simulate 100 concurrent users hitting your API:

  1. Install k6: Follow the instructions on the k6 website.
  2. Create a load_test.js file:
    import http from 'k6/http';
    import { sleep, check } from 'k6';
    
    export const options = {
      vus: 100, // 100 virtual users
      duration: '1m', // for 1 minute
      thresholds: {
        http_req_duration: ['p(95)<500'], // 95% of requests must complete within 500ms
        http_req_failed: ['rate<0.01'], // less than 1% of requests can fail
      },
    };
    
    export default function () {
      const res = http.get('https://your-api.example.com/data');
      check(res, {
        'is status 200': (r) => r.status === 200,
      });
      sleep(1); // Wait 1 second between requests
    }
    
  3. Run the test: Execute k6 run load_test.js.

This script simulates 100 users continuously hitting your /data endpoint for one minute. The thresholds are critical: they define your performance acceptance criteria. If 95% of requests don’t complete within 500ms, or if more than 1% of requests fail, the test will fail, indicating a performance regression or bottleneck.

Pro Tip: Integrate these load tests into your CI/CD pipeline, specifically in a staging environment. Make them a mandatory gate before deployment to production. If the load test fails, the deployment should be blocked. This prevents performance regressions from ever reaching your users.

Common Mistakes: Running load tests from your local machine, which can skew results due to network latency or local resource constraints. Use dedicated cloud-based load testing services or distributed k6 runners for more accurate simulations. Also, not testing against a production-like dataset, which can hide database performance issues.

10. Automate Incident Response and Self-Healing

The ultimate goal of automation isn’t just to prevent problems, but to automatically fix them when they occur. This is where incident response automation and self-healing systems come into play. Instead of waking up an engineer at 3 AM for a known issue, an automated runbook can often resolve it without human intervention.

This is a more advanced stage of automation, often leveraging a combination of the tools we’ve already discussed. For example, a Datadog alert could trigger an AWS Lambda function, which then executes an Ansible playbook or a Kubernetes command to restart a failing service or scale up a database.

Consider this hypothetical scenario: A specific microservice (e.g., payment-processor) starts reporting a high error rate in Datadog.

  1. Datadog Monitor: A monitor on payment-processor.errors.rate > 0.05 for 5 minutes.
  2. Alert Action: Instead of just sending a Slack message, configure the alert to trigger an AWS SNS topic.
  3. Lambda Function: An AWS Lambda function subscribes to the SNS topic. This function is written in Python (or Node.js) and contains logic to:
    • Identify the failing service (e.g., from the alert payload).
    • Connect to the Kubernetes cluster (using kubernetes-client library).
    • Execute a command like kubectl rollout restart deployment payment-processor.
    • Send a notification to Slack/email indicating that an automated restart was attempted.

This creates a basic self-healing mechanism. If the service is merely in a bad state, a restart often resolves it. If it doesn’t, the subsequent alert (after the restart attempt fails to resolve the issue) can then escalate to a human. This significantly reduces mean time to recovery (MTTR) for common issues.

Pro Tip: Start small with self-healing. Automate responses for well-understood, low-risk incidents first (e.g., restarting a stuck service, clearing a temporary cache). Avoid automating complex or irreversible actions without extensive testing and safeguards.

Common Mistakes: Over-automating without proper testing, leading to “runaway automation” that makes problems worse. Always include safeguards, rollbacks, and clear logging for automated actions. Another mistake is not documenting your automated runbooks; future engineers need to understand why and how things are being automatically fixed.

Implementing these automated workflows isn’t just about efficiency; it’s about building a resilient, scalable, and secure application that can grow with your user base. The journey from manual operations to full automation is continuous, but each step provides tangible benefits, freeing your team to focus on innovation rather than firefighting.

What is the single most important automation to implement first for a new application?

Without a doubt, establishing a Continuous Integration/Continuous Deployment (CI/CD) pipeline is the most critical first step. It ensures consistent code delivery, reduces manual errors, and forms the foundation for all other automation efforts.

How can small teams manage the complexity of so many automation tools?

Small teams should prioritize adopting managed services where possible (e.g., AWS RDS for databases, GitHub Actions for CI/CD, Datadog for observability). This offloads much of the operational burden. Start with a few core tools and gradually expand as your needs and expertise grow, rather than trying to implement everything at once.

Is Infrastructure as Code (IaC) really necessary for a small project?

Yes, absolutely. Even for small projects, IaC (like Terraform) ensures your infrastructure is documented, reproducible, and can be easily provisioned or torn down. It prevents “snowflake” servers and makes scaling up or even recreating environments trivial, which saves time and reduces errors in the long run.

What’s the difference between automated monitoring and automated alerting?

Automated monitoring is the continuous collection and visualization of metrics, logs, and traces from your application and infrastructure. Automated alerting is the proactive notification system that triggers when specific thresholds or anomalies are detected within that monitored data, requiring immediate attention or action.

How often should automated security scans be performed?

Automated security scans (SAST, DAST, dependency scans) should be performed at a minimum on every code commit or pull request merge in your CI/CD pipeline. For critical production environments, daily or even hourly scans of running systems or container images can be beneficial, especially for detecting new zero-day vulnerabilities.

Cynthia Harris

Principal Software Architect MS, Computer Science, Carnegie Mellon University

Cynthia Harris is a Principal Software Architect at Veridian Dynamics, boasting 15 years of experience in crafting scalable and resilient enterprise solutions. Her expertise lies in distributed systems architecture and microservices design. She previously led the development of the core banking platform at Ascent Financial, a system that now processes over a billion transactions annually. Cynthia is a frequent contributor to industry forums and the author of "Architecting for Resilience: A Microservices Playbook."