fix(fetch): detect encoding for non-UTF-8 pages using charset-normalizer by olegsa · Pull Request #3880 · modelcontextprotocol/servers

olegsa · 2026-04-09T07:36:40Z

Summary

Add automatic character encoding detection to mcp-server-fetch using charset-normalizer for pages that don't declare charset in the HTTP Content-Type header
Introduces get_response_text() helper that checks response.charset_encoding first, then falls back to statistical byte analysis via charset-normalizer
Fixes garbled text when fetching pages served in non-UTF-8 encodings (e.g. windows-1251, windows-1255, windows-1256, euc-kr) without a charset declaration in the HTTP header

Motivation

Many websites (especially non-English ones) serve content in legacy encodings like windows-1255 (Hebrew), windows-1251 (Cyrillic), windows-1256 (Arabic), or euc-kr (Korean) without declaring the charset in the HTTP Content-Type header. The current code uses response.text which defaults to UTF-8, producing garbled/mojibake output for these pages.

Changes

server.py: Added get_response_text() that uses charset-normalizer for encoding detection when HTTP headers lack charset info. Replaced response.text with get_response_text(response) in fetch_url(). Also moved httpx from local imports to a top-level import.
pyproject.toml: Added charset-normalizer>=3.0.0 as an explicit dependency (already a transitive dep via requests).
tests/test_server.py: Added TestGetResponseText class with 5 tests covering UTF-8 passthrough, Ukrainian (windows-1251), Hebrew (windows-1255), Arabic (windows-1256), and Korean (euc-kr) encoding detection.

Test plan

All existing tests pass
New encoding detection tests pass for Ukrainian, Hebrew, Arabic, Korean
UTF-8 pages with charset in HTTP header still use the standard path
Non-HTML content (JSON, etc.) is unaffected

Made with Cursor

…d refactor HTTP client usage

…tests Add charset-normalizer as an explicit dependency in pyproject.toml and add tests verifying correct decoding of non-UTF-8 pages (Ukrainian windows-1251, Hebrew windows-1255, Arabic windows-1256, Korean euc-kr). Made-with: Cursor

Made-with: Cursor

olaservo

Nice fix — the get_response_text() helper is well-structured with a conservative fallback chain (charset header → charset-normalizer detection → default). Good test coverage across multiple encodings.

One nit: test_korean_euc_kr actually encodes the text as UTF-8, not euc-kr. Consider renaming it to test_utf8_without_charset_header, or changing it to actually encode as euc-kr to match the test name.

Otherwise LGTM.

This review was assisted by Claude Code.

The test was encoding the Korean text as UTF-8, so it did not exercise the charset-normalizer detection path. Encode with euc-kr and extend the sample text so statistical detection has enough bytes to work with. Addresses review feedback on modelcontextprotocol#3880. Made-with: Cursor

olegsa · 2026-04-20T17:16:04Z

Thanks for the review, @olaservo! Good catch on the euc-kr test — you're right, it was silently encoding as UTF-8. Fixed in 37da471: the test now actually calls text.encode("euc-kr") and uses a longer sample so charset-normalizer's statistical detection has enough bytes to work with. This gives us genuine coverage of the detection path for Korean rather than a duplicate of test_utf8_passthrough. All 5 TestGetResponseText tests still pass locally.

olegsa added 4 commits April 9, 2026 10:32

feat(fetch): enhance HTTP response handling with charset detection an…

578c61a

…d refactor HTTP client usage

chore(fetch): regenerate uv.lock after adding charset-normalizer

50995f8

Made-with: Cursor

Update revision number in uv.lock file

a754e89

olaservo reviewed Apr 12, 2026

View reviewed changes

olegsa added 2 commits April 20, 2026 20:10

Merge branch 'main' into fix/fetch-encoding-detection

07c7846

cliffhall added bug Something isn't working server-fetch Reference implementation for the Fetch MCP server - src/fetch labels Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(fetch): detect encoding for non-UTF-8 pages using charset-normalizer#3880

fix(fetch): detect encoding for non-UTF-8 pages using charset-normalizer#3880
olegsa wants to merge 6 commits intomodelcontextprotocol:mainfrom
olegsa:fix/fetch-encoding-detection

olegsa commented Apr 9, 2026

Uh oh!

olaservo left a comment

Uh oh!

olegsa commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

olegsa commented Apr 9, 2026

Summary

Motivation

Changes

Test plan

Uh oh!

olaservo left a comment

Choose a reason for hiding this comment

Uh oh!

olegsa commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants