Skip to content

PERF: Use simdutf for decoding utf16#526

Open
ffelixg wants to merge 2 commits intomicrosoft:mainfrom
ffelixg:simdutf
Open

PERF: Use simdutf for decoding utf16#526
ffelixg wants to merge 2 commits intomicrosoft:mainfrom
ffelixg:simdutf

Conversation

@ffelixg
Copy link
Copy Markdown
Contributor

@ffelixg ffelixg commented Apr 15, 2026

Work Item / Issue Reference

GitHub Issue: #514


Summary

  • Use external dependency simdutf to decode utf16 when decoding in c++
  • Use pybind11 support for std::u16string to decode utf16 when interacting with python
  • Remove SQLWCHARToWString, WideToUTF8 and Utf8ToWString, refactor call sites

Copilot AI review requested due to automatic review settings April 15, 2026 23:50
@ffelixg
Copy link
Copy Markdown
Contributor Author

ffelixg commented Apr 15, 2026

I have been evaluating the external library simdutf as a high performance replacement for utf16 -> utf8 conversions, i.e. the functions SQLWCHARToWString, WideToUTF8 and Utf8ToWString. Rather than only using it for the arrow fetch path, I have been trying to make the switch for every location where one of these three functions is used, as the applications follow similar patterns. I didn't use simdutf in every case, another way to eliminate these function calls was to use std::u16string instead of std::wstring when passing strings to/from python. I think this avoids the whole issue where wchars are defined as 32 bit on some OSes but SQLWCHARs are always 16 bit.

This brings arrow performance on linux for nvarchars in line with what it should be.

If the std::u16string type works as well as I hope it does (will have to see what CI says about mac), there are some more spots where it could be used, for example to replace WStringToSQLWCHAR

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a higher-performance and more platform-consistent UTF-16LE → UTF-8 conversion path in the pybind ODBC layer by adopting simdutf and shifting several wide-string interfaces from std::wstring to std::u16string (to avoid wchar_t width differences across OSes).

Changes:

  • Add simdutf (via find_package or FetchContent) and use it for UTF-16LE → UTF-8 conversions in diagnostics and data fetch paths.
  • Replace a number of std::wstring usages with std::u16string for connection strings, queries, and SQL_C_WCHAR parameter buffers.
  • Remove legacy SQLWCHAR→wstring conversion utilities that are no longer used.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
mssql_python/pybind/CMakeLists.txt Adds simdutf dependency resolution and links it into the ddbc_bindings module.
mssql_python/pybind/ddbc_bindings.h Introduces UTF-16 helpers (dupeSqlWCharAsUtf16Le, utf16LeToUtf8Alloc) and changes ErrorInfo to store UTF-8 std::string.
mssql_python/pybind/ddbc_bindings.cpp Switches parameter binding, diagnostics, query execution, and fetch conversions to UTF-16 + simdutf.
mssql_python/pybind/connection/connection.h Updates connection APIs/state to store connection strings as std::u16string.
mssql_python/pybind/connection/connection.cpp Uses UTF-16 connection string/query handling and returns UTF-8 error messages directly.
mssql_python/pybind/connection/connection_pool.h Updates pooling key types and APIs to use std::u16string.
mssql_python/pybind/connection/connection_pool.cpp Implements pooling with std::u16string connection string keys.
mssql_python/pybind/unix_utils.h Removes the SQLWCHARToWString declaration.
mssql_python/pybind/unix_utils.cpp Removes the SQLWCHARToWString implementation.
Comments suppressed due to low confidence (1)

mssql_python/pybind/ddbc_bindings.cpp:583

  • In the SQL_C_WCHAR non-DAE path, the bound buffer is a std::u16string but the bind call uses data() + SQL_NTS and sets bufferLength to size() * sizeof(SQLWCHAR). For SQL_NTS, BufferLength should include the null terminator, and the pointer should be a null-terminated buffer (prefer c_str()). As written, drivers that validate BufferLength may treat this as truncated or read past the provided length.

Suggested fix: use sqlwcharBuffer->c_str() and set bufferLength to (sqlwcharBuffer->size() + 1) * sizeof(SQLWCHAR), or alternatively set *strLenOrIndPtr to the explicit byte length (excluding terminator) and keep BufferLength consistent.

                    std::u16string* sqlwcharBuffer = AllocateParamBuffer<std::u16string>(
                        paramBuffers, param.cast<std::u16string>());
                    LOG("BindParameters: param[%d] SQL_C_WCHAR - String "
                        "length=%zu characters, buffer=%zu bytes",
                        paramIndex, sqlwcharBuffer->size(), sqlwcharBuffer->size() * sizeof(SQLWCHAR));
                    dataPtr = sqlwcharBuffer->data();
                    bufferLength = sqlwcharBuffer->size() * sizeof(SQLWCHAR);
                    strLenOrIndPtr = AllocateParamBuffer<SQLLEN>(paramBuffers);
                    *strLenOrIndPtr = SQL_NTS;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mssql_python/pybind/ddbc_bindings.h
@gargsaumya
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

static_assert(sizeof(SQLWCHAR) == sizeof(char16_t), "SQLWCHAR must be 16-bit");

if (length > 0) {
std::memcpy(utf16.data(), value, length * sizeof(SQLWCHAR));
@github-actions
Copy link
Copy Markdown

📊 Code Coverage Report

🔥 Diff Coverage

92%


🎯 Overall Coverage

25%


📈 Total Lines Covered: 6959 out of 26856
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/pybind/connection/connection.cpp (100%)
  • mssql_python/pybind/ddbc_bindings.cpp (91.4%): Missing lines 1562,1637,1654
  • mssql_python/pybind/ddbc_bindings.h (92.3%): Missing lines 107-108

Summary

  • Total: 67 lines
  • Missing: 5 lines
  • Coverage: 92%

mssql_python/pybind/ddbc_bindings.cpp

Lines 1558-1566

  1558     LOG("SQLCheckError: Checking ODBC errors - handleType=%d, retcode=%d", handleType, retcode);
  1559     ErrorInfo errorInfo;
  1560     if (retcode == SQL_INVALID_HANDLE) {
  1561         LOG("SQLCheckError: SQL_INVALID_HANDLE detected - handle is invalid");
! 1562         errorInfo.ddbcErrorMsg = "Invalid handle!";
  1563         return errorInfo;
  1564     }
  1565     assert(handle != 0);
  1566     SQLHANDLE rawHandle = handle->get();

Lines 1633-1641

  1633     return records;
  1634 }
  1635 
  1636 // Wrap SQLExecDirect
! 1637 SQLRETURN SQLExecDirect_wrap(SqlHandlePtr StatementHandle, const std::u16string& Query) {
  1638     LOG("SQLExecDirect: Executing query directly - statement_handle=%p, "
  1639         "query_length=%zu chars",
  1640         (void*)StatementHandle->get(), Query.length());
  1641     if (!SQLExecDirect_ptr) {

Lines 1650-1658

  1650         SQLSetStmtAttr_ptr(StatementHandle->get(), SQL_ATTR_CONCURRENCY,
  1651                            (SQLPOINTER)SQL_CONCUR_READ_ONLY, 0);
  1652     }
  1653 
! 1654     SQLWCHAR* queryPtr = const_cast<SQLWCHAR*>(reinterpretU16stringAsSqlWChar(Query));
  1655     SQLRETURN ret = SQLExecDirect_ptr(StatementHandle->get(), queryPtr, SQL_NTS);
  1656     if (!SQL_SUCCEEDED(ret)) {
  1657         LOG("SQLExecDirect: Query execution failed - SQLRETURN=%d", ret);
  1658     }

mssql_python/pybind/ddbc_bindings.h

  103 #endif
  104 
  105 inline std::string utf16LeToUtf8Alloc(std::u16string utf16) {
  106     if (utf16.empty()) {
! 107         return {};
! 108     }
  109 
  110     simdutf::result utf8Length =
  111         simdutf::utf8_length_from_utf16le_with_replacement(utf16.data(), utf16.size());
  112     std::string utf8(utf8Length.count, '\0');


📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.build._deps.simdutf-src.src.haswell.implementation.cpp: 0.8%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.error.h: 2.1%
mssql_python.pybind.build._deps.simdutf-src.src.implementation.cpp: 6.9%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.implementation.h: 10.4%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.scalar.utf16.h: 18.2%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.scalar.utf16_to_utf8.utf16_to_utf8.h: 28.7%
mssql_python.pybind.build._deps.simdutf-src.src.simdutf.haswell.simd16-inl.h: 33.3%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.build._deps.simdutf-src.src.generic.utf16.utf8_length_from_utf16_bytemask.h: 61.5%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.internal.isadetection.h: 65.3%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants