fix(fetch): detect encoding for non-UTF-8 pages using charset-normalizer#3880
fix(fetch): detect encoding for non-UTF-8 pages using charset-normalizer#3880olegsa wants to merge 6 commits intomodelcontextprotocol:mainfrom
Conversation
…d refactor HTTP client usage
…tests Add charset-normalizer as an explicit dependency in pyproject.toml and add tests verifying correct decoding of non-UTF-8 pages (Ukrainian windows-1251, Hebrew windows-1255, Arabic windows-1256, Korean euc-kr). Made-with: Cursor
Made-with: Cursor
olaservo
left a comment
There was a problem hiding this comment.
Nice fix — the get_response_text() helper is well-structured with a conservative fallback chain (charset header → charset-normalizer detection → default). Good test coverage across multiple encodings.
One nit: test_korean_euc_kr actually encodes the text as UTF-8, not euc-kr. Consider renaming it to test_utf8_without_charset_header, or changing it to actually encode as euc-kr to match the test name.
Otherwise LGTM.
This review was assisted by Claude Code.
The test was encoding the Korean text as UTF-8, so it did not exercise the charset-normalizer detection path. Encode with euc-kr and extend the sample text so statistical detection has enough bytes to work with. Addresses review feedback on modelcontextprotocol#3880. Made-with: Cursor
|
Thanks for the review, @olaservo! Good catch on the euc-kr test — you're right, it was silently encoding as UTF-8. Fixed in 37da471: the test now actually calls |
Summary
mcp-server-fetchusingcharset-normalizerfor pages that don't declare charset in the HTTPContent-Typeheaderget_response_text()helper that checksresponse.charset_encodingfirst, then falls back to statistical byte analysis viacharset-normalizerMotivation
Many websites (especially non-English ones) serve content in legacy encodings like windows-1255 (Hebrew), windows-1251 (Cyrillic), windows-1256 (Arabic), or euc-kr (Korean) without declaring the charset in the HTTP
Content-Typeheader. The current code usesresponse.textwhich defaults to UTF-8, producing garbled/mojibake output for these pages.Changes
server.py: Addedget_response_text()that usescharset-normalizerfor encoding detection when HTTP headers lack charset info. Replacedresponse.textwithget_response_text(response)infetch_url(). Also moved httpx from local imports to a top-level import.pyproject.toml: Addedcharset-normalizer>=3.0.0as an explicit dependency (already a transitive dep viarequests).tests/test_server.py: AddedTestGetResponseTextclass with 5 tests covering UTF-8 passthrough, Ukrainian (windows-1251), Hebrew (windows-1255), Arabic (windows-1256), and Korean (euc-kr) encoding detection.Test plan
Made with Cursor