Why Your ESP Dashboard Lies To You | Engagor

I keep a spreadsheet I call "The Lie Document."

It's a simple table comparing how different ESPs define the same metrics. "Delivered." "Bounced." "Opened." "Engaged." Fifteen columns, eight major ESPs, and not a single row where everyone agrees.

Last year, I showed this spreadsheet to a client who was trying to compare performance between SendGrid and Mailgun. They'd been running the same campaign through both platforms, splitting their list 50/50, expecting to see which ESP performed better.

"SendGrid shows 97% delivery and Mailgun shows 89%," they said. "So SendGrid wins, right?"

Not quite. SendGrid was counting "accepted by the first hop" as delivered. Mailgun was counting "accepted by the final destination MTA." Same campaign. Different definitions. An 8-point gap that didn't exist.

They'd been about to move their entire program to SendGrid based on a definitional discrepancy.

"Delivered" Doesn't Mean What You Think

Here's the first thing I wish every email sender understood: when your ESP says an email was "delivered," they almost certainly don't mean it reached the inbox.

The term "delivered" in email can mean at least five different things:

1. Accepted by any MTA in the chain. The email left our servers and something accepted it. Could be a spam filter, a relay, a security gateway. The recipient may never see it.

2. Accepted by the recipient's MX. We got a 250 response from the receiving mail server. This is better, but still doesn't mean inbox. The receiving server might still filter it to spam, quarantine it, or drop it silently.

3. Not bounced. If we didn't get an explicit rejection, we call it delivered. Deferrals that eventually timed out? Delivered. Silent drops by spam filters? Delivered.

4. Successfully placed. The email actually landed somewhere the recipient can see it—inbox, spam folder, promotions tab. Very few platforms actually confirm this.

5. Reached the inbox. Almost no platform actually knows this without seed testing or pixel tracking, and even then it's an estimate.

Most ESPs use definition #2 or #3. But they present it as if it were #5. The dashboard shows "Delivered: 98%," and marketers assume 98% of their emails reached inboxes. They didn't.

The gap between "accepted by MTA" and "reached inbox" is typically 10-20%. For senders with reputation problems, it can be 40% or more. That's a lot of emails your "delivery rate" is lying about.

The Bounce Classification Problem

Bounces should be simple. Hard bounce means the address doesn't exist. Soft bounce means temporary failure. Right?

I exported bounce data from three ESPs for the same sending domain over the same week. Here's what I found:

ESP A: 2.1% hard bounces, 0.8% soft bounces ESP B: 1.3% hard bounces, 1.9% soft bounces ESP C: 3.4% hard bounces, 0.2% soft bounces

Same emails. Same recipients. Same week. Three completely different stories.

The discrepancy comes from classification logic. Some ESPs classify a "mailbox full" as a soft bounce (temporary). Others count it as hard (if it's been full for 7 days, the account is abandoned). Some classify a 5xx block as a hard bounce. Others correctly identify it as a reputation-based rejection, which is really a deliverability problem, not a list problem.

I've seen ESPs classify the same error code three different ways:

Error	ESP A	ESP B	ESP C
550 User unknown	Hard bounce	Hard bounce	Hard bounce
550 Blocked for spam	Hard bounce	Soft bounce	Block (separate category)
452 Mailbox full	Soft bounce	Soft bounce	Hard bounce
421 Rate limited	Soft bounce	Deferred	Not counted

Try building a consistent list hygiene strategy when your platforms can't agree on what a bounce means.

Engagement Metrics Are Even Worse

At least with bounces, you can dig into the raw error codes and make your own determination. Engagement metrics are black boxes.

Open rates: We've already established that Apple MPP breaks these, but even before that, ESPs disagreed on what counted as an "open." One pixel load or multiple? What if the email client caches the image? What about text-only clients that never load images?

Click rates: Are they unique clicks or total clicks? If someone clicks the same link three times, is that one click or three? What about bot clicks from security scanners—does your ESP filter those?

Click-to-open rate: Depends entirely on how opens are counted. If your open denominator is inflated by bots, your CTOR is meaningless.

And here's the fun part: every ESP calculates these slightly differently, but none of them document exactly how. The "open rate" in SendGrid is not the same calculation as the "open rate" in Mailchimp, but both are presented as if they're the universal definition.

Good luck benchmarking your performance against "industry averages."

The Export Problem

"Just export the data and normalize it yourself."

I've heard this advice a hundred times. Here's why it doesn't work.

Each ESP exports different fields, in different formats, with different granularity. One gives you event-level data with microsecond timestamps. Another gives you daily aggregates. A third gives you recipient-level summaries but not raw events.

To normalize this data, you'd need to:

Map fields from each ESP to a common schema (and each ESP uses different terminology)
Handle missing data (one ESP tracks link clicks; another doesn't)
Align timestamps across timezones (not all ESPs use UTC)
Deduplicate events (some ESPs log retries as separate events)
Reconcile recipient identifiers (email address formats, case sensitivity, encoding)
Do this continuously, because your email program doesn't stop while you're building normalization logic

I've watched enterprise clients spend six months building data pipelines to compare performance across three ESPs. By the time the pipeline was working, they'd added a fourth ESP, and the cycle started over.

Why This Matters

This isn't academic. Bad data leads to bad decisions.

When you can't compare performance across ESPs, you can't answer basic questions:

Which ESP actually delivers better for our Gmail audience?
Are the bounce rates on our new platform really higher, or just classified differently?
Why does our engagement look 15% better in one dashboard than another?

And you definitely can't answer strategic questions:

Should we consolidate to one ESP or is the current split working?
Which platform should handle our transactional emails?
Where are our deliverability risks concentrated?

I've seen clients make the wrong call on ESP migrations because they were comparing apples to oranges. I've seen teams get blamed for performance drops that were actually reclassification changes in the reporting layer. I've seen executives lose faith in email metrics entirely because the numbers never match.

The dashboards aren't intentionally lying. But they're not telling the truth either.

The Path Forward

Option one: accept that your metrics are inconsistent and make decisions based on vibes and vendor relationships. This is what most companies do, whether they admit it or not.

Option two: invest significant engineering resources in building and maintaining a normalization layer. This works if you have the team for it and are willing to treat email analytics as a core infrastructure problem.

Option three: use a platform that does the normalization for you.

This is why we built Engagor's unified data layer. Every ESP event gets normalized to a common schema the moment it arrives. A "bounce" means the same thing regardless of where it originated. A "delivery" means actual inbox placement confidence, not just "accepted by some server."

When a client asks "which ESP has better Gmail performance," they get a real answer. Not "well, SendGrid says X and Mailgun says Y, but they're counting different things."

I spent 18 years doing email consulting before building Engagor. At least a third of that time was spent normalizing data between platforms so I could give clients accurate advice. That's not a good use of anyone's time.

Your ESP dashboards show you what they want you to see. Engagor shows you what's actually happening.

See how it works →

Autonomous Intelligence

Deliverability Analytics

Engagement Intelligence

Unified Infrastructure

Solutions by Role

Solutions by Challenge

Solutions by Industry

"Delivered" Doesn't Mean What You Think

The Bounce Classification Problem

Engagement Metrics Are Even Worse

The Export Problem

Why This Matters

The Path Forward

Bram Van Daele

"Delivered" Doesn't Mean What You Think

The Bounce Classification Problem

Engagement Metrics Are Even Worse

The Export Problem

Why This Matters

The Path Forward

Bram Van Daele

Related Articles

When the System Notices Something Before You Do

The 3 AM Alert Nobody Answers: From Alert Fatigue to Autonomous Investigation

The Hidden Architecture of Email: What Happens in the 500ms After You Hit Send