Conversation
|
I have been evaluating the external library simdutf as a high performance replacement for utf16 -> utf8 conversions, i.e. the functions SQLWCHARToWString, WideToUTF8 and Utf8ToWString. Rather than only using it for the arrow fetch path, I have been trying to make the switch for every location where one of these three functions is used, as the applications follow similar patterns. I didn't use simdutf in every case, another way to eliminate these function calls was to use std::u16string instead of std::wstring when passing strings to/from python. I think this avoids the whole issue where wchars are defined as 32 bit on some OSes but SQLWCHARs are always 16 bit. This brings arrow performance on linux for nvarchars in line with what it should be. If the std::u16string type works as well as I hope it does (will have to see what CI says about mac), there are some more spots where it could be used, for example to replace |
There was a problem hiding this comment.
Pull request overview
This PR introduces a higher-performance and more platform-consistent UTF-16LE → UTF-8 conversion path in the pybind ODBC layer by adopting simdutf and shifting several wide-string interfaces from std::wstring to std::u16string (to avoid wchar_t width differences across OSes).
Changes:
- Add
simdutf(viafind_packageorFetchContent) and use it for UTF-16LE → UTF-8 conversions in diagnostics and data fetch paths. - Replace a number of
std::wstringusages withstd::u16stringfor connection strings, queries, and SQL_C_WCHAR parameter buffers. - Remove legacy SQLWCHAR→wstring conversion utilities that are no longer used.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
mssql_python/pybind/CMakeLists.txt |
Adds simdutf dependency resolution and links it into the ddbc_bindings module. |
mssql_python/pybind/ddbc_bindings.h |
Introduces UTF-16 helpers (dupeSqlWCharAsUtf16Le, utf16LeToUtf8Alloc) and changes ErrorInfo to store UTF-8 std::string. |
mssql_python/pybind/ddbc_bindings.cpp |
Switches parameter binding, diagnostics, query execution, and fetch conversions to UTF-16 + simdutf. |
mssql_python/pybind/connection/connection.h |
Updates connection APIs/state to store connection strings as std::u16string. |
mssql_python/pybind/connection/connection.cpp |
Uses UTF-16 connection string/query handling and returns UTF-8 error messages directly. |
mssql_python/pybind/connection/connection_pool.h |
Updates pooling key types and APIs to use std::u16string. |
mssql_python/pybind/connection/connection_pool.cpp |
Implements pooling with std::u16string connection string keys. |
mssql_python/pybind/unix_utils.h |
Removes the SQLWCHARToWString declaration. |
mssql_python/pybind/unix_utils.cpp |
Removes the SQLWCHARToWString implementation. |
Comments suppressed due to low confidence (1)
mssql_python/pybind/ddbc_bindings.cpp:583
- In the SQL_C_WCHAR non-DAE path, the bound buffer is a std::u16string but the bind call uses
data()+SQL_NTSand setsbufferLengthtosize() * sizeof(SQLWCHAR). ForSQL_NTS, BufferLength should include the null terminator, and the pointer should be a null-terminated buffer (preferc_str()). As written, drivers that validate BufferLength may treat this as truncated or read past the provided length.
Suggested fix: use sqlwcharBuffer->c_str() and set bufferLength to (sqlwcharBuffer->size() + 1) * sizeof(SQLWCHAR), or alternatively set *strLenOrIndPtr to the explicit byte length (excluding terminator) and keep BufferLength consistent.
std::u16string* sqlwcharBuffer = AllocateParamBuffer<std::u16string>(
paramBuffers, param.cast<std::u16string>());
LOG("BindParameters: param[%d] SQL_C_WCHAR - String "
"length=%zu characters, buffer=%zu bytes",
paramIndex, sqlwcharBuffer->size(), sqlwcharBuffer->size() * sizeof(SQLWCHAR));
dataPtr = sqlwcharBuffer->data();
bufferLength = sqlwcharBuffer->size() * sizeof(SQLWCHAR);
strLenOrIndPtr = AllocateParamBuffer<SQLLEN>(paramBuffers);
*strLenOrIndPtr = SQL_NTS;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| static_assert(sizeof(SQLWCHAR) == sizeof(char16_t), "SQLWCHAR must be 16-bit"); | ||
|
|
||
| if (length > 0) { | ||
| std::memcpy(utf16.data(), value, length * sizeof(SQLWCHAR)); |
📊 Code Coverage Report
Diff CoverageDiff: main...HEAD, staged and unstaged changes
Summary
mssql_python/pybind/ddbc_bindings.cppLines 1558-1566 1558 LOG("SQLCheckError: Checking ODBC errors - handleType=%d, retcode=%d", handleType, retcode);
1559 ErrorInfo errorInfo;
1560 if (retcode == SQL_INVALID_HANDLE) {
1561 LOG("SQLCheckError: SQL_INVALID_HANDLE detected - handle is invalid");
! 1562 errorInfo.ddbcErrorMsg = "Invalid handle!";
1563 return errorInfo;
1564 }
1565 assert(handle != 0);
1566 SQLHANDLE rawHandle = handle->get();Lines 1633-1641 1633 return records;
1634 }
1635
1636 // Wrap SQLExecDirect
! 1637 SQLRETURN SQLExecDirect_wrap(SqlHandlePtr StatementHandle, const std::u16string& Query) {
1638 LOG("SQLExecDirect: Executing query directly - statement_handle=%p, "
1639 "query_length=%zu chars",
1640 (void*)StatementHandle->get(), Query.length());
1641 if (!SQLExecDirect_ptr) {Lines 1650-1658 1650 SQLSetStmtAttr_ptr(StatementHandle->get(), SQL_ATTR_CONCURRENCY,
1651 (SQLPOINTER)SQL_CONCUR_READ_ONLY, 0);
1652 }
1653
! 1654 SQLWCHAR* queryPtr = const_cast<SQLWCHAR*>(reinterpretU16stringAsSqlWChar(Query));
1655 SQLRETURN ret = SQLExecDirect_ptr(StatementHandle->get(), queryPtr, SQL_NTS);
1656 if (!SQL_SUCCEEDED(ret)) {
1657 LOG("SQLExecDirect: Query execution failed - SQLRETURN=%d", ret);
1658 }mssql_python/pybind/ddbc_bindings.h📋 Files Needing Attention📉 Files with overall lowest coverage (click to expand)mssql_python.pybind.build._deps.simdutf-src.src.haswell.implementation.cpp: 0.8%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.error.h: 2.1%
mssql_python.pybind.build._deps.simdutf-src.src.implementation.cpp: 6.9%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.implementation.h: 10.4%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.scalar.utf16.h: 18.2%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.scalar.utf16_to_utf8.utf16_to_utf8.h: 28.7%
mssql_python.pybind.build._deps.simdutf-src.src.simdutf.haswell.simd16-inl.h: 33.3%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.build._deps.simdutf-src.src.generic.utf16.utf8_length_from_utf16_bytemask.h: 61.5%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.internal.isadetection.h: 65.3%🔗 Quick Links
|
Work Item / Issue Reference
Summary