Skip to content

fix(fetch): detect encoding for non-UTF-8 pages using charset-normalizer#3880

Open
olegsa wants to merge 6 commits intomodelcontextprotocol:mainfrom
olegsa:fix/fetch-encoding-detection
Open

fix(fetch): detect encoding for non-UTF-8 pages using charset-normalizer#3880
olegsa wants to merge 6 commits intomodelcontextprotocol:mainfrom
olegsa:fix/fetch-encoding-detection

Conversation

@olegsa
Copy link
Copy Markdown

@olegsa olegsa commented Apr 9, 2026

Summary

  • Add automatic character encoding detection to mcp-server-fetch using charset-normalizer for pages that don't declare charset in the HTTP Content-Type header
  • Introduces get_response_text() helper that checks response.charset_encoding first, then falls back to statistical byte analysis via charset-normalizer
  • Fixes garbled text when fetching pages served in non-UTF-8 encodings (e.g. windows-1251, windows-1255, windows-1256, euc-kr) without a charset declaration in the HTTP header

Motivation

Many websites (especially non-English ones) serve content in legacy encodings like windows-1255 (Hebrew), windows-1251 (Cyrillic), windows-1256 (Arabic), or euc-kr (Korean) without declaring the charset in the HTTP Content-Type header. The current code uses response.text which defaults to UTF-8, producing garbled/mojibake output for these pages.

Changes

  • server.py: Added get_response_text() that uses charset-normalizer for encoding detection when HTTP headers lack charset info. Replaced response.text with get_response_text(response) in fetch_url(). Also moved httpx from local imports to a top-level import.
  • pyproject.toml: Added charset-normalizer>=3.0.0 as an explicit dependency (already a transitive dep via requests).
  • tests/test_server.py: Added TestGetResponseText class with 5 tests covering UTF-8 passthrough, Ukrainian (windows-1251), Hebrew (windows-1255), Arabic (windows-1256), and Korean (euc-kr) encoding detection.

Test plan

  • All existing tests pass
  • New encoding detection tests pass for Ukrainian, Hebrew, Arabic, Korean
  • UTF-8 pages with charset in HTTP header still use the standard path
  • Non-HTML content (JSON, etc.) is unaffected

Made with Cursor

olegsa added 4 commits April 9, 2026 10:32
…tests

Add charset-normalizer as an explicit dependency in pyproject.toml and
add tests verifying correct decoding of non-UTF-8 pages (Ukrainian
windows-1251, Hebrew windows-1255, Arabic windows-1256, Korean euc-kr).

Made-with: Cursor
Copy link
Copy Markdown
Member

@olaservo olaservo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix — the get_response_text() helper is well-structured with a conservative fallback chain (charset header → charset-normalizer detection → default). Good test coverage across multiple encodings.

One nit: test_korean_euc_kr actually encodes the text as UTF-8, not euc-kr. Consider renaming it to test_utf8_without_charset_header, or changing it to actually encode as euc-kr to match the test name.

Otherwise LGTM.


This review was assisted by Claude Code.

olegsa added 2 commits April 20, 2026 20:10
The test was encoding the Korean text as UTF-8, so it did not exercise
the charset-normalizer detection path. Encode with euc-kr and extend
the sample text so statistical detection has enough bytes to work with.

Addresses review feedback on modelcontextprotocol#3880.

Made-with: Cursor
@olegsa
Copy link
Copy Markdown
Author

olegsa commented Apr 20, 2026

Thanks for the review, @olaservo! Good catch on the euc-kr test — you're right, it was silently encoding as UTF-8. Fixed in 37da471: the test now actually calls text.encode("euc-kr") and uses a longer sample so charset-normalizer's statistical detection has enough bytes to work with. This gives us genuine coverage of the detection path for Korean rather than a duplicate of test_utf8_passthrough. All 5 TestGetResponseText tests still pass locally.

@cliffhall cliffhall added bug Something isn't working server-fetch Reference implementation for the Fetch MCP server - src/fetch labels Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working server-fetch Reference implementation for the Fetch MCP server - src/fetch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants