Unifying 17 Microservices Across 6 Teams: API Standardization at GAF Energy
The Stakes
GAF Energy's engineering org had 17 microservices spread across 6 product teams, and every one of them authenticated, formatted responses, and instrumented analytics differently. Developers switching teams lost an entire sprint just learning how a new service worked. PII was leaking through endpoints that lacked proper access controls. We were paying roughly $75K/year on three redundant Apigee gateway tenants that should have been one. We were scaling headcount but not scaling output, and every new hire made the entropy worse.
My Role
I was the Technical Product Manager responsible for designing and driving the API consolidation initiative from audit through adoption. I owned the problem space end-to-end: scoping the audit of all 17 services, facilitating architecture decisions with tech leads, and managing an 8-person engineering team through execution. The project touched every product team in the org (6 PMs, 6 tech leads, and roughly 40–50 engineers whose daily workflows would change).
This wasn't a greenfield build. We were operating on live systems with active feature development across every team, which meant I had to sequence the migration so we never blocked high-priority work. Like most orgs shipping at our pace, the standing 20% sprint allocation for tech debt was routinely consumed by feature work, so I had to build a business case compelling enough to protect a dedicated team for a full quarter.
The Real Problem
The surface-level read was "our APIs are messy." The deeper problem became clear through two channels. First, each quarter I facilitated SWOT sessions with all tech leads: in-person whiteboard sessions during quarterly on-sites, plus digital sessions with the broader engineering org where engineers nominated issues and voted on their top three per category. API inconsistency and authentication fragmentation weren't #1 in any single quarter, but they sat at #2 or #3 consistently. That persistence told me this was a systemic drag, not a one-off frustration. The voting also surfaced the specific engineers closest to the pain, who became my first collaborators when building the case for leadership.
Second, the audit itself confirmed the scale of the problem. My team and I documented every service, endpoint, authentication method, and data contract across all 17 microservices.
What we found:
We were running 3–4 different authentication methods across our services. Some teams used short-lived tokens that refreshed every 2 hours; one team had a manually-rotated token on a 90-day cycle. There was no centralized client registry; teams provisioned access ad hoc with no audit trail. Our Apigee API Gateway, which should have been a single control plane, had fragmented into 3 separate tenants because different teams had spun up their own instances independently.
On the data side, our analytics events had no schema validation. Frontend and backend teams instrumented events using different naming conventions, field structures, and granularity levels. When those events hit our BigQuery warehouse, normalization was a manual, error-prone process, which meant our product and data teams were making decisions on unreliable data.
The insight that reframed the entire approach: this wasn't an API cleanup. It was a platform problem. We didn't need to fix 17 services; we needed to build the single standard that all 17 would converge on, and that standard had to be extensible enough to eventually expose to external partners through our upcoming Proposal API for roofing partners (our end users).
Approach & Trade-Offs
We evaluated four API gateway and auth solutions, plus an in-house build, against our core criteria: native GCP integration (our entire infrastructure lived in Google Cloud), compatibility with Salesforce (which served as a de facto backend for parts of our CRM workflow), support for our BigQuery and Cloud SQL data layer, and a licensing model that wouldn't penalize us for consolidating.
| Criteria | Apigee | Kong Enterprise | Okta | Firebase Auth | In-House |
|---|---|---|---|---|---|
| Native GCP integration | ✅ First-party | ⚠️ Requires config | ⚠️ Plugin-based | ✅ First-party | ✅ Full control |
| Salesforce compatibility | ✅ Supported | ⚠️ Custom work | ✅ Native connector | ❌ Not designed for this | ⚠️ Custom work |
| OAuth 2.0 client credentials | ✅ Full support | ✅ Full support | ✅ Full support | ⚠️ Limited | ✅ Full control |
| Rate limiting / throttling | ✅ Built-in | ✅ Built-in | ❌ Not a gateway | ❌ Not a gateway | ⚠️ Must build |
| Operational overhead | Low (managed) | High (self-hosted) | Low (managed) | Low (managed) | Very high |
| External partner API readiness | ✅ Strong | ✅ Strong | ⚠️ Auth-only | ❌ Not suited | ⚠️ Must build |
| Already partially in use | ✅ Yes | ❌ No | ❌ No | ❌ No | N/A |
Okta solved only the auth layer without giving us gateway, rate limiting, or request routing, so it would have required pairing with a separate solution. Firebase Auth was designed for consumer-facing apps, not service-to-service credential flows. Kong Enterprise had strong gateway features but required self-hosted infrastructure management our DevOps team couldn't absorb. Building in-house was rejected on headcount: the maintenance burden would have been a permanent tax on our small platform team.
We chose Apigee, consolidated to a single tenant, and implemented a unified OAuth 2.0 client credential flow across all services. The deciding factor beyond GCP integration was forward compatibility. The architecture we built treated every consumer of the API the same way: each internal microservice was simply a client with its own credentials, permissions, and rate limits. When we later shipped the Proposal API for roofing partners, enabling their CRMs to submit leads and receive solar designs and qualification statuses programmatically, we literally just added a new client type. Roofing partners got their own API keys with a different permission set and rate limit tier, but the underlying patterns were identical to what our internal services used. This also gave us a security win: if any single service (internal or external) went haywire, we could revoke its token and cut access to internal systems within seconds without affecting anything else.
For the response payload problem, we introduced GraphQL on high-traffic frontend-facing endpoints where REST responses were returning excessive data. A concrete example: GAF Energy's data model was originally built around homeowners, so entities in Salesforce were structured around addresses with every historical resident nested under them. After the business pivoted to serve roofers, we rebuilt around leads and roofer accounts, but the legacy data structure persisted in many endpoints. When the roofer portal loaded active leads, the REST endpoint returned every person who had ever lived at each address, not just the current lead. The frontend wasn't surfacing any of it, just waiting for it to load. GraphQL eliminated that bloat entirely. For service-to-service communication, we kept REST with strictly versioned contracts and validated schemas.
For analytics, we implemented centralized event schema validation at the gateway level. Every event now conformed to a defined contract before reaching the pipeline, which meant our BigQuery warehouse received normalized, trustworthy data for the first time. The next phase we scoped (but didn't get to prioritize) would have introduced pre-commit hooks and CI-level schema validation so developers were notified of contract violations before code reached staging, rather than discovering errors post-deployment.
What we deliberately cut. Early in scoping, I mapped every requirement and identified the top 80% of impact we had to deliver so that when we needed to cut scope (and we always do), I wasn't losing sleep over the bottom 20%. The most notable cut was a dedicated health check endpoint. We wanted a lightweight endpoint that DevOps and eventually external roofer integrations could ping before sending bulk requests. Engineering leadership argued you could ping any existing endpoint and infer health from a valid response. They weren't wrong, but a purpose-built health endpoint would have been lighter on the system and would have prevented roofers from firing 50 async requests at a downed service. Pragmatic trade-off: useful but not critical to the core migration. We also deprioritized pre-production schema validation monitoring. We believed (correctly) that gateway-level validation would catch most issues, but we underestimated how much friction the lack of automated pre-prod feedback would create for developers after launch.
Execution
The project ran across 6 two-week sprints: 4 planned for core execution, 2 additional for hypercare and adoption. Here's where the real coordination happened:
Parallel migration without blocking feature work
For teams mid-sprint on high-priority features, my team built bridge layers, maintaining their existing API contracts while standing up the new standardized versions underneath. Once their feature shipped, we migrated them to the new auth and response patterns. For teams on lower-impact work, we coordinated with their PMs to pause and adopt early. This required me to map every team's roadmap, identify safe migration windows, and negotiate sequencing with 6 product managers and their tech leads.
The builder contract that changed the plan
About a month into execution, GAF Energy landed a major contract with a national homebuilder in the western U.S., representing roughly 40% of our annual revenue. Builders have plan sets but no satellite imagery, which means a completely different design and sales workflow than existing-home roofers. They needed reliable API access immediately, with patterns we could commit to, not contracts that would change a month later. So we had to finalize our client credential architecture earlier than planned and stand up a stable subset of the new API specifically for them, ahead of the broader rollout. It compressed our timeline but also validated our approach: the fact that we could onboard an external partner mid-migration using the same patterns we were building for internal services proved the platform design was sound.
The deprecation clock
After the core migration was complete, I set a deprecation deadline for the legacy API endpoints. I intentionally reviewed every major project's expected delivery date to make sure the cutoff didn't collide with critical launches. I coordinated with product leadership and tech leads across all 6 teams to align on the date, and privately built in a 2-week buffer because I knew adoption would lag. It did: the real adoption push happened in that buffer window, which was exactly the plan.
Documentation and training as a product, not an afterthought
I wrote comprehensive API documentation, recorded walkthrough videos showing developers how to authenticate against the new unified gateway, manage client credentials, and request new endpoints. Then I ran hands-on working sessions with each team alongside our tech lead. These weren't presentations; they were actual pairing sessions where developers migrated a real endpoint live. This wasn't optional; it was the adoption mechanism.
Stragglers and broken dashboards
Even after deprecation, we discovered a handful of internal dashboards and scripts quietly hitting legacy endpoints, things no one had documented or mentioned. I personally triaged each one, working with the affected stakeholders to migrate them. It wasn't glamorous, but it was the difference between "done" and actually done.
Results
| Metric | Before | After | Timeframe |
|---|---|---|---|
| Authentication methods in use | 3–4 | 1 (unified OAuth 2.0) | At deprecation deadline |
| Apigee gateway tenants | 3 | 1 | Sprint 3 |
| Apigee licensing cost | ~$75K/year | ~$25K/year | Immediate |
| Developer onboarding time (new team) | ~1 full sprint (2 weeks) | < 2 days | Post-adoption |
| Frontend API response payload | 60–70% unused data | Only requested fields | Post-migration |
| Roofer portal page load time | ~4.2s | ~1.8s | Post-migration |
| API-related escalations | ~8–12/month | ~2–3/month | 8 weeks post-launch |
| Event schema compliance | Inconsistent | 100% validated at gateway | Post-migration |
| Teams fully adopted | 0/6 | 6/6 | Sprint 6 |
The harder-to-quantify win was cultural. After this project, any mid-level or senior engineer could switch teams and immediately work with the API layer. No ramp sprint, no guesswork about auth patterns, no hunting for tribal knowledge. The API became predictable. That predictability compounded into faster feature delivery across every team for every sprint that followed.
One lasting outcome: the project convinced me to formalize the feedback loop. I launched quarterly developer experience surveys across all 6 teams so we'd have a consistent, measurable lens into platform friction going forward, rather than waiting for problems to surface through SWOT sessions or incident reports.
What I'd Do Differently
I underestimated the documentation timeline. The training sessions and docs should have started in Sprint 2, not Sprint 4. By the time we were ready for adoption, developers were already frustrated by the change. Having polished docs and videos available during migration rather than after would have cut resistance significantly and probably saved us that 5th sprint.
I also would have run the PII data classification work in parallel rather than treating it as a separate follow-on project. The API audit surfaced the PII exposure issues early, and we had the context and momentum to address both simultaneously. Sequencing them apart meant re-auditing endpoints we'd already touched.
The lesson I carry forward: when you're doing platform work, ship the developer experience alongside the infrastructure, not after it. Engineers adopt what feels easy, not what's architecturally correct.
So What
This project is how I work. Start with a clear goal. Deploy every tool in the belt when it's the right one for the job: data to diagnose, architecture to solve, documentation to scale, presence to drive adoption. Stay in the room until it's real. Then measure honestly, figure out what actually worked, and identify what needs to change. This one came away with a unified platform, a faster engineering org, and a feedback loop to catch the next problem earlier. Even the parts we didn't finish became scoped proposals ready for the next quarter, because even incomplete work should produce key learnings, not just regret.