Site Reliability Engineer @ one2n.io

September 2025 - Present

DevOps and Cloud Infrastructure Consultant for an AI based Fraud Detection and AML Monitoring Startup

Accelerated Provisioning: Leveraged Terraform and Pulumi to automate regional and BYOC (Bring Your Own Cloud) environment deployments. Reducing TAT for new infra from 1 week to 1 day.
Cost Optimization: Saved $4.8k/mo by tuning Kubernetes requests and limits to eliminate over-provisioning of GKE Nodes.
Zero-Trust Security: Secured back-office portals by implementing Cloudflare Zero Trust.
Rapid Recovery: Reduced rollback time from 30 to 5 minutes by introducing optimized pipelines.
Multi-Cloud PaaS Architecture: Led the re-architecture of the GCP based SaaS into a PaaS to support enterprise self-deployment across on-prem, AWS, GCP and Azure.
Data Warehousing and Analytics: Led architecture planning for multi-tenancy and deployment of an end-to-end analytics platform based out of Starrocks Data Warehouse, Debezium and Kuberay on GKE
SOC2 Compliance: Led infrastructure and development action items for SOC2 Type 2 certification.

April 2025 - September 2025

Site Reliability Engineer for a Mid-sized Fintech (Payments Platform similar to PayTM)

Production & Infra Support: On-call and infrastructure ownership across 5 AWS environments for 45+ services during monolith-to-microservices migration.
Alerting: Reduced PagerDuty alerts from 100/day to 4/day via signal and threshold tuning.
Reliability: Implemented Datadog Synthetics to protect critical SMS/OTP authentication flows.
Automation: Cut pre-deploy checks from 1 hour to 10 seconds using Python CLI and GitHub Actions.
12-Factor Architecture: Refactored Ruby services with Shoryuken and Sidekiq on AWS SQS; isolated workers into separate containers for fault isolation and faster recovery.
IaC: Decoupled Terraform from CI/CD and introduced a custom PR review workflow for infrastructure changes.
Deployments: Rolled out Kubernetes canary releases for 27 high-traffic services.
Incident Management: Designed SLI/SLOs to improve MTTD and MTTR.
Self-Healing: Automated Aurora Postgres lock remediation, eliminating manual intervention.