CSS: Fix inconsistent UTF-8 scrubbing in decode_string_or_url slow path#229
CSS: Fix inconsistent UTF-8 scrubbing in decode_string_or_url slow path#229
decode_string_or_url slow path#229Conversation
Agent-Logs-Url: https://github.com/WordPress/php-toolkit/sessions/c82f4270-56cb-4261-bee7-92e7971aba49 Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>
decode_string_or_url slow path
Agent-Logs-Url: https://github.com/WordPress/php-toolkit/sessions/892f6cca-d936-43dc-9f63-27739478ddbd Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>
Compared both patches — the code fix is identical in both. Replaced my 3 URL/string-token tests with the 2 cleaner ident-based tests from #231 (same structure as the existing invalid-UTF-8 tests, showing both Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
|
The two additional tests fail on trunk: I think that the test failures are poorly rendered because the invalid UTF-8 is replaced with |
|
@copilot Why doesn't the entire method just rely on the already scrubbed |
... Those failures are expected — they're the tests being run against the unfixed trunk branch, which is the intended regression-test behavior. In our branch the CI "PHPUnit Tests" run confirms all tests pass (commit 70fb724). The confusing visual diff is inherent: both the expected value (
The slow path works with absolute byte offsets into Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
| if ( $this->is_valid_escape( $at ) ) { | ||
| ++$at; | ||
| $decoded .= $this->decode_escape_at( $at, $bytes_consumed ); | ||
| $decoded .= wp_scrub_utf8( $this->decode_escape_at( $at, $bytes_consumed ) ); |
There was a problem hiding this comment.
Or should we call this in decode_escape_at()?
There was a problem hiding this comment.
@copilot This is a good question. Follow-up:
decode_escape_at()expects a valid UTF-8 sequence, doesn't it? Is there any handling for invalid utf-8 internally in that method?- Should the UTF-8 scrub move inside
decode_escape_at()?
Critically consider these questions and the tradeoffs, then make a change if necessary.
There was a problem hiding this comment.
Moved the fix into decode_escape_at() itself (commit 08fdb3a). Analysis:
decode_escape_at()already uses_wp_scan_utf8to detect the invalid subpart, so we know exactly when bytes are invalid — no need to scan again withwp_scrub_utf8()- In the invalid branch, we now
return "\u{FFFD}"directly, consistent with the EOF and null-byte branches which already returned U+FFFD - In the valid branch,
substr()already returns valid UTF-8 — no scrubbing needed - Removed the stale
wp_scrub_utf8()wrapper at the call site indecode_string_or_url, and updated the misleading "Later on the caller will replace..." comment - The 4 other callers that only use
$bytes_consumed(discarding the return value) are unaffected
adamziel
left a comment
There was a problem hiding this comment.
LGTM, just left one minor note
Agent-Logs-Url: https://github.com/WordPress/php-toolkit/sessions/693a68c5-780d-4b75-bdb2-61be871e88f5 Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>
|
Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details. Warning Firewall rules blocked me from connecting to one or more addresses (expand for details)I tried to connect to the following addresses, but was blocked by firewall rules:
If you need me to access, download, or install something from one of these locations, you can either:
|
The slow path of
decode_string_or_urlpulled raw bytes from$this->csswithout scrubbing, while the fast path always appliedwp_scrub_utf8(). This meant a CSS value containing an invalid UTF-8 byte alongside a backslash escape could leak invalid bytes throughget_token_value().Two places were affected:
substr($this->css, $at, $normal_len)was appended directly; any invalid UTF-8 bytes in non-special-char runs were not replaced with U+FFFDdecode_escape_at()returned raw bytes in its "anything else" branch (the code comment there even noted replacement should happen in the caller, but never did)Fix:
wp_scrub_utf8(), consistent with the fast pathdecode_escape_at()itself: when_wp_scan_utf8detects an invalid subpart, the method now returns"\u{FFFD}"directly rather than returning raw bytes for the caller to scrub. All return paths ofdecode_escape_at()now consistently yield valid UTF-8Tests added:
test_invalid_utf8_in_normal_segment_combined_with_escape— ident token.test\xF1\\41name: invalid byte (0xF1) in the normal segment before a hex escape; slow path triggered by the backslash, but the preceding segment was not previously scrubbedtest_invalid_utf8_as_escaped_character— ident token.a\\\xF1b: invalid byte as the escaped character itself;decode_escape_at()now replaces it with U+FFFD internally