JAVA-5949 preserve connection pool on backpressure errors when establishing connections#1900
Conversation
|
Assigned |
|
Added the following to the description of the PR (a new piece of work to be done within the PR): DRIVERS-3394: adjust spec test description (#1894)(JAVA-6095). |
| // clear the pool as they're not related to overload. | ||
| // TLS configuration errors (certificate validation, protocol mismatches) should also clear the pool | ||
| // as they indicate configuration issues, not server overload. | ||
| if (beforeHandshake && !sdamIssue.relatedToAuth() && !sdamIssue.relatedToTlsConfigurationError()) { |
There was a problem hiding this comment.
Currently we attach SystemOverloadedError and RetryableError labels in the SDAM error-handling path (effectively only for DefaultServer). In load-balanced mode, SDAM isn’t involved: the LB code path invalidates the pool directly (e.g., connectionPool.invalidate(serviceId, generation)), so the labeling logic is bypassed.
This means users running the driver in LB mode (behind an NLB) can still hit network errors, TLS handshake failures, timeouts during connection establishment or hello, but won’t get the labels.
However, these labels are a CMAP requirement, not SDAM. The CMAP spec states: “The pool MUST add the error labels SystemOverloadedError and RetryableError to network errors or network timeouts it encounters during the connection establishment or the hello message.”
Since this is defined as a pool behavior (topology-agnostic), it seems we should implement the labeling in the connection pool layer so it applies consistently in both default and load-balanced modes.
There was a problem hiding this comment.
Pull request overview
This PR updates SDAM/backpressure handling so that connection-establishment failures caused by server-side backpressure do not clear the connection pool, and adds/updates tests and matchers to align with the SDAM spec test expectations.
Changes:
- Adjust SDAM exception handling to avoid invalidating the pool on certain pre-handshake network failures, and add TLS-configuration-error classification.
- Add new error label constants (
SystemOverloadedError,RetryableError) and apply them for backpressure-related failures. - Extend unified event matching for additional server types and add a prose test for connection pool backpressure behavior.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| driver-sync/src/test/functional/com/mongodb/client/unified/EventMatcher.java | Extends server type matching for SDAM serverDescriptionChangedEvent expectations. |
| driver-sync/src/test/functional/com/mongodb/client/ServerDiscoveryAndMonitoringProseTests.java | Adds a prose test intended to validate connection pool backpressure behavior via server rate limiter parameters. |
| driver-core/src/test/unit/com/mongodb/internal/connection/DefaultServerSpecification.groovy | Updates unit tests to expect pool preservation on network errors during connection establishment (sync/async). |
| driver-core/src/main/com/mongodb/internal/connection/SdamServerDescriptionManager.java | Adds TLS configuration error detection helper on SDAM issues. |
| driver-core/src/main/com/mongodb/internal/connection/DefaultSdamServerDescriptionManager.java | Changes SDAM exception handling to preserve pool/server description for certain pre-handshake failures and apply labels. |
| driver-core/src/main/com/mongodb/MongoException.java | Introduces new public error label constants used by the backpressure handling path. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /** | ||
| * An error label indicating that the server is overloaded. | ||
| * | ||
| * @see #hasErrorLabel(String) | ||
| * @since 5.7 | ||
| */ | ||
| public static final String SYSTEM_OVERLOADED_ERROR_LABEL = "SystemOverloadedError"; | ||
|
|
||
| /** | ||
| * An error label indicating that the operation is safely retryable. | ||
| * | ||
| * @see #hasErrorLabel(String) | ||
| * @since 5.7 | ||
| */ | ||
| public static final String RETRYABLE_ERROR_LABEL = "RetryableError"; | ||
|
|
There was a problem hiding this comment.
I introduced these constants in #1918. There is more to this than the change do in the current PR. Let's revert the change in the current PR to resolve the conflict with backpressure.
| // Backpressure spec: Don't clear pool or mark server unknown for connection establishment failures | ||
| // (network errors or timeouts during handshake). Authentication errors after handshake should still | ||
| // clear the pool as they're not related to overload. | ||
| // TLS configuration errors (certificate validation, protocol mismatches) should also clear the pool | ||
| // as they indicate configuration issues, not server overload. |
There was a problem hiding this comment.
We don't usually copy the specification prose into code as comments. I suspect, this comment was left by the AI.
Let's remove this comment, unless there is a good reason to leave it, which was not applicable to the rest of the code in this file that existed before the current PR.
| // Don't update server description to Unknown | ||
| // Don't invalidate the connection pool | ||
| // Apply error labels for backpressure |
There was a problem hiding this comment.
We don't usually copy the specification prose into code as comments. I suspect, this comment was left by the AI.
Let's remove this comment, unless there is a good reason to leave it, which was not applicable to the rest of the code in this file that existed before the current PR.
There was a problem hiding this comment.
[this comment is left on a file, but has nothing to do with the file; this is done so that we could reply to the comment; commenting on a PR does not allow replies - horrendous GitHub functionality]
The current PR seemingly depends on #1856 (see the description of the current PR). We need to decide what to do with that.
P.S. I originally left my thoughts in the description of this PR, but I don't know if that's what we are going to do.
There was a problem hiding this comment.
We no longer depend on this PR see #1900 (comment)
There was a problem hiding this comment.
[this comment is left on a file, but has nothing to do with the file; this is done so that we could reply to the comment; commenting on a PR does not allow replies - horrendous GitHub functionality]
- Could you please confirm that all the specification changes listed in the "Specification changes" part of the description of the current PR have been addressed in the PR?
- If they have been addressed, let's update the description correspondingly.
There was a problem hiding this comment.
Description updated. All specs are implemented
There was a problem hiding this comment.
[this comment is left on a file, but has nothing to do with the file; this is done so that we could reply to the comment; commenting on a PR does not allow replies - horrendous GitHub functionality]
JAVA-5949 has an old comment, which instructs to re-enable some tests that were previously disabled when we updated the testing/resources/specifications submodule in main. Those tests were not enabled. Let's enable them and make sure they pass.
There was a problem hiding this comment.
This PR no longer depends on #1856. Spec PR mongodb/specifications#1880 rewrote the backpressure-network-*-fail.yml tests to wait on serverDescriptionChangedEvent instead of serverHeartbeatSucceededEvent.
…establishing connections
- Fixing static check analysis
0be0250 to
bce97d4
Compare
…handshake network timeout
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
JAVA-6141
- Deprioritize sharded clusters on any error, all other topologies only on SystemOverloadedError. - Pass ClusterType to updateCandidate so onAttemptFailure can distinguish topology types. - Add retryable reads prose tests 3.1 and 3.2. - Change ServerSelectionSelectionTest to use BaseCluster server selection chain. JAVA-6105 JAVA-6021 JAVA-6074 --------- Co-authored-by: Valentin Kovalenko <valentin.male.kovalenko@gmail.com> Co-authored-by: Ross Lawley <ross.lawley@gmail.com>
The relevant spec changes: - https://github.com/mongodb/specifications/blob/8a8a7c56429c80b51ec62268dcafc5e5e3c477ef/source/client-backpressure/tests/README.md JAVA-5956, JAVA-6117, JAVA-6113, JAVA-6119, JAVA-6141
- Add enableOverloadRetargeting boolean option to MongoClientSettings and ConnectionString to allow the driver to route requests to a different replica set member on retries when the previously used server is overloaded - Add prose test 3.3 to verify that overload errors are retried on the same server when retargeting is disabled JAVA-6167 --------- Co-authored-by: Ross Lawley <ross.lawley@gmail.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ntMatcher.java Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return false; | ||
| } | ||
|
|
||
| static boolean isTlsConfigurationError(final Throwable t) { |
There was a problem hiding this comment.
isDnsLookupFailure and isTlsConfigurationError take Throwable, but the applyLabelsIfEligible has already narrowed the exception to MongoSocketException. Since isDnsLookupFailure and isTlsConfigurationError are not used with non-MongoSocketException inputs on the production path, we could tighten both signatures to MongoSocketException to make the precondition explicit.
This would also let us simplify sdamIssueTlsCheckDelegatesToHelper to assert TLS classification via applyLabelsIfEligible (the actual contract), rather than calling the helper with inputs it never receives.
| static boolean isTlsConfigurationError(final Throwable t) { | |
| private static boolean isTlsConfigurationError(final MongoSocketException t) { |
| Throwable cause = t.getCause(); | ||
| while (cause != null) { | ||
| if (cause instanceof CertificateException | ||
| || cause instanceof CertPathBuilderException | ||
| || cause instanceof CertPathValidatorException | ||
| || cause instanceof SSLPeerUnverifiedException | ||
| || cause instanceof SSLProtocolException) { | ||
| return true; | ||
| } | ||
| if (cause instanceof SSLHandshakeException) { | ||
| String message = cause.getMessage(); | ||
| if (message != null) { | ||
| String lowerMessage = message.toLowerCase(Locale.ROOT); | ||
| if (lowerMessage.contains("certificate") | ||
| || lowerMessage.contains("verify") | ||
| || lowerMessage.contains("trust") | ||
| || lowerMessage.contains("hostname") | ||
| || lowerMessage.contains("protocol") | ||
| || lowerMessage.contains("cipher") | ||
| || lowerMessage.contains("handshake_failure")) { | ||
| return true; | ||
| } | ||
| } | ||
| } | ||
| cause = cause.getCause(); | ||
| } | ||
| return false; |
There was a problem hiding this comment.
I dug into isTlsConfigurationError and the TLS exception landscape across JDK versions and providers. Sharing what I found below and what we discussed over the call.
Current TLS config error gaps
The keyword list (certificate, verify, trust, hostname, protocol, cipher, handshake_failure) catches 12 of the 25 RFC-defined TLS alert descriptions but misses these config errors.
| Alert | JDK exception message | Labeled? |
|---|---|---|
unknown_ca(48) |
"Received fatal alert: unknown_ca" |
Yes -- false postive |
access_denied(49) |
"Received fatal alert: access_denied" |
Yes -- *false positive * |
insufficient_security(71) |
"Received fatal alert: insufficient_security" |
Yes -- false positive |
unrecognized_name(112) |
"Received fatal alert: unrecognized_name" |
Yes -- false positive |
BouncyCastle's JSSE provider is currently unhandled: it produces TlsFatalAlertReceived (extends IOException, not SSLException), however, there is no good way to catch those.
How FIN, RST, and TLS alerts surface across providers and transports
Full exception chains (42 test cases)
FIN (graceful close during TLS handshake):
| Transport | Provider | Exception chain |
|---|---|---|
| SocketStream | JDK | MongoSocketWriteException -> SSLHandshakeException("Remote host terminated the handshake") -> EOFException("SSL peer shut down incorrectly") |
| NettyStream | JDK | MongoSocketWriteException -> ClosedChannelException |
| TlsChannelStream | JDK | MongoSocketWriteException -> ClosedChannelException |
| SocketStream | BC | MongoSocketWriteException -> TlsFatalAlert("handshake_failure(40)") |
| NettyStream | BC | MongoSocketWriteException -> ClosedChannelException |
| TlsChannelStream | BC | MongoSocketWriteException -> ClosedChannelException |
RST (abrupt reset during TLS handshake):
| Transport | Provider | Exception chain |
|---|---|---|
| SocketStream | JDK | MongoSocketWriteException -> SocketException("Connection reset") |
| NettyStream | JDK | MongoSocketWriteException -> SslClosedEngineException("SSLEngine closed already") |
| TlsChannelStream | JDK | MongoSocketWriteException -> SocketException("Connection reset") |
| SocketStream | BC | MongoSocketWriteException -> SocketException("Connection reset") |
| NettyStream | BC | MongoSocketWriteException -> SslClosedEngineException("SSLEngine closed already") |
| TlsChannelStream | BC | MongoSocketWriteException -> SocketException("Connection reset") |
TLS alert example -- unknown_ca(48):
| Transport | Provider | Exception chain |
|---|---|---|
| SocketStream | JDK | MongoSocketWriteException -> SSLHandshakeException("Received fatal alert: unknown_ca") |
| NettyStream | JDK | MongoSocketWriteException -> SSLHandshakeException("Received fatal alert: unknown_ca") |
| TlsChannelStream | JDK | MongoSocketWriteException -> SSLHandshakeException("Received fatal alert: unknown_ca") |
| SocketStream | BC | MongoSocketWriteException -> TlsFatalAlertReceived("unknown_ca(48)") |
| NettyStream | BC | MongoSocketWriteException -> SSLException -> TlsFatalAlertReceived("unknown_ca(48)") |
| TlsChannelStream | BC | MongoSocketWriteException -> SSLException -> TlsFatalAlertReceived("unknown_ca(48)") |
Key observation: FIN/RST get labeled correctly today because they don't match any exclusion check . A TLS alert from the server means it actively responded, so it wasn't shedding load.
Assumption: A rate limiter drops the TCP connection (FIN/RST) before or during a TLS exchange.
Two approaches
Both are heuristic-based -- perfect classification isn't possible since exception types and messages are provider implementation details.
A. Improved blocklist (current approach): Label by default, exclude what we can identify as not-overload.
Failure mode: unrecognized config errors stay labeled (pool preserved, unnecessary retries).
it should be better for unknown providers - if we can't classify an error, pool is preserved, which is the safer default under load.
B. Allowlist: Only label recognized I/O patterns (EOFException, SocketException("Connection reset"), ClosedChannelException).
Failure mode: unrecognized I/O errors are not labeled (pool cleared unnecessarily). This is the pre-backpressure behavior. For example, BC represents FIN as TlsFatalAlert("handshake_failure(40)") - the allowlist misses it, pool gets cleared.
TL;DR / Suggestion:
As we discussed over the call, the blocklist (Approach A) matches the CMAP spec's direction ("the pool MUST add the error labels" as default, with exclusions carved out).
However, we could check for all RFC alert description strings, so we can improve the heuristic.
All 25 handshake-only TLS alert descriptions from OpenJDK's Alert.java (RFC 8446/5246)
These are the alert codes with handshakeOnly=true in OpenJDK. When received from a peer, they produce SSLHandshakeException("Received fatal alert: <description>"). All are definitively config/protocol errors, not I/O:
| Code | Description string | Currently caught by keywords? |
|---|---|---|
| 40 | handshake_failure |
Yes (handshake_failure) |
| 41 | no_certificate |
Yes (certificate) |
| 42 | bad_certificate |
Yes (certificate) |
| 43 | unsupported_certificate |
Yes (certificate) |
| 44 | certificate_revoked |
Yes (certificate) |
| 45 | certificate_expired |
Yes (certificate) |
| 46 | certificate_unknown |
Yes (certificate) |
| 47 | illegal_parameter |
No |
| 48 | unknown_ca |
No |
| 49 | access_denied |
No |
| 50 | decode_error |
No |
| 51 | decrypt_error |
No |
| 60 | export_restriction |
No |
| 70 | protocol_version |
Yes (protocol) |
| 71 | insufficient_security |
No |
| 100 | no_renegotiation |
No |
| 109 | missing_extension |
No |
| 110 | unsupported_extension |
No |
| 111 | certificate_unobtainable |
Yes (certificate) |
| 112 | unrecognized_name |
No |
| 113 | bad_certificate_status_response |
Yes (certificate) |
| 114 | bad_certificate_hash_value |
Yes (certificate) |
| 115 | unknown_psk_identity |
No |
| 116 | certificate_required |
Yes (certificate) |
| 120 | no_application_protocol |
No |
We should keep the existing type checks (CertificateException, CertPathBuilderException, CertPathValidatorException, SSLPeerUnverifiedException, SSLProtocolException) - they handle local validation failures correctly on both providers (and should on others).
P.S Claude was used to examine JSSE providers and create a table of RFC messages used in OpenJDK's provider.
| def.skipJira("https://jira.mongodb.org/browse/JAVA-5949") | ||
| .file("server-discovery-and-monitoring", "backpressure-network-timeout-error-replicaset"); | ||
| def.skipJira("https://jira.mongodb.org/browse/JAVA-5949") | ||
| def.skipJira("https://jira.mongodb.org/browse/JAVA-6174") |
There was a problem hiding this comment.
I suggest marking this as // TODO-BACKPRESSURE, since before the release we should re-enable the test together with the fix.
| serverMonitor.connect(); | ||
| } else if (sdamIssue.relatedToNetworkNotTimeout() | ||
| || (beforeHandshake && (sdamIssue.relatedToNetworkTimeout() || sdamIssue.relatedToAuth()))) { | ||
| if (sdamIssue.hasBackpressureLabel()) { |
There was a problem hiding this comment.
Let's name it isSystemOverloadedError to be precise as we do not have Backpressure label?
| if (sdamIssue.hasBackpressureLabel()) { | |
| if (sdamIssue.hasSystemOverloadedLabel()) { |
| BackpressureErrorLabeler.applyLabelsIfEligible(exceptionToThrow) | ||
| assert !exceptionToThrow.hasErrorLabel(MongoException.SYSTEM_OVERLOADED_ERROR_LABEL) |
There was a problem hiding this comment.
This part seems better suited for the unit tests where the SUT is BackpressureLabeler (e.g., BackpressureErrorLabelerTest). In this test the SUT is server.getConnection(), so it would be preferable to avoid asserting labeling internals here.
Could we instead assert the “label not applied” outcome at the behavior level in the then: section (e.g., on def e = thrown(MongoException)) and keep the unit tests in BackpressureLabeler responsible for the labeling implementation details?
|
|
||
| class BackpressureErrorLabelerTest { | ||
|
|
||
| private static final ServerAddress ADDRESS = new ServerAddress(); |
There was a problem hiding this comment.
I think it’s worth extending coverage here to include broaden TLS failure modes/exceptions:
- Local TLS configuration failures (client-side), and
- Handshake failures caused by server alerts, which surface as
SSLHandshakeExceptionwith OpenJDK’s JSSE provider.
This also looks like a good fit for a parameterized test to cover multiple exception types/cases without duplicating test logic.
Suggested change to be more specific: vbabanin@d78261e
| assumeTrue(serverVersionAtLeast(7, 0)); | ||
|
|
||
| TestConnectionPoolListener connectionPoolListener = new TestConnectionPoolListener(); | ||
|
|
||
| MongoClientSettings clientSettings = getMongoClientSettingsBuilder() | ||
| .applyToConnectionPoolSettings(builder -> builder | ||
| .maxConnecting(100) | ||
| .addConnectionPoolListener(connectionPoolListener)) | ||
| .build(); |
There was a problem hiding this comment.
The spec says:
If running against a sharded cluster, this test MUST be run against only a single host
I don’t think we’re enforcing that requirement here, should we adjust the setup to ensure the test targets a single host when sharded?
Original PR #1854 accidentally closed, and had no outstanding review comments.
This current PR depends on [JAVA-6033] ServerHeartbeatSucceededEvent is not fired for initial POLL monitoring #1856:Update: This PR no longer depends on [JAVA-6033] ServerHeartbeatSucceededEvent is not fired for initial POLL monitoring #1856. Spec PR DRIVERS-3367: Fix racy backpressure-network-* tests specifications#1880 rewrote the
backpressure-network-*-fail.ymltests to wait onserverDescriptionChangedEventinstead ofserverHeartbeatSucceededEvent.we should merge the latter intomain, then merge this PR inbackpressure, and leave the necessaryTODOcomments in #1918 related to the dependency on #1856.backpressurebranch, rebasebackpressureon top ofmainto include the changes from themain(see Mergemainintobackpressure#1917 (comment) for more context on this process).TODOcomments.Specification changes:
Note that there was a change in the description of one of the tests that belongs to this PR, but I closed the corresponding JAVA-6095 (see this comment): DRIVERS-3394: adjust spec test description (#1894).
JAVA-5949, JAVA-6056, JAVA-5664
AI usage and effectiveness: