Current Status of my SBCL Repository

Last updated: 2013-01-31 10:44:54 UTC, 2013-01-31 14:44:54 Moscow (Russia).
Issue Tracker at GitHub
SBCL/win32 and Threads: Initial Discussion
Version 1.1.4.0.mswin.1288-90ab477: What's New, ToDo, &c
Future of this project depends on you. Donate!
Local SBCL Downloads:
- 11 MiB ~ sbcl-1.1.4.0.mswin.1288-90ab477-x86.msi: MSI package for 32-bit Windows (ia32)
- 12 MiB ~ sbcl-1.1.4.0.mswin.1288-90ab477-x86-64.msi: MSI package for 64-bit Windows (AMD64)
- 3611 KiB ~ sbcl-1.1.4.0.mswin.1288-90ab477.tar.bz2: Source snapshot for this version
- Archived stand-alone executables, with most `contrib' modules preloaded. They may be useful when normal software installation is not allowed for your account, but running unverified executables is allowed.
  - ATTENTION: those executables include some useful REPL extensions, so there may be a reason to prefer them; however, source code implementing the extensions is not published yet. The latter may be a reason to avoid the executables, or to contact me and ask for source access.
  - 11 MiB ~ sbcl-ci-exe-1.1.4.0.mswin.1288-90ab477-x86.zip: Stand-alone executable for 32-bit Windows (ia32)
  - 11 MiB ~ sbcl-ci-exe-1.1.4.0.mswin.1288-90ab477-x86-64.zip: Stand-alone executable for 64-bit Windows (AMD64)
Archive directory: public builds since 1.0.50.0.mswinmt.865
Do you need some other executable? Feel free to contact me. The Issue Tracker provided by GitHub may be more convenient for this purpose.
Do you need support or assistance for Common Lisp development with SBCL and Windows? Don't hesitate to ask: I'd be glad to provide consulting and support services at reasonable price.

Merge Status at the End of 2012

Features Merged before SBCL-1.1.3
Probably To Be Merged Later
Probably Never To Be Merged
For Happy Users of my Fork

I have to Apologize...

Current Changes

Thread-Local Data Storage

Foreign Thread Callbacks

Stdcall Callbacks

Memory-Mapped Core on Win32

Foreign Symbol References in Cross-Compiled LISP-OBJS

Splitting GENESIS-2: Compile-Lisp-OBJS, Collect-Refs, Relink-Runtime
The Alternative: Dynamic Resolving of Runtime Symbols
SB-DYNAMIC-CORE Meets LINKAGE-TABLE

Linkage-table on Win32

Surviving Data Execution Prevention

Special Handling of Console FDs in read() and write() equivalents

Backtrace with Foreign Function Names

Most Questionable Hack: Lisp/C FPU Context Switch

GC and GetWriteWatch

32 KiB BACKEND_PAGE_BYTES

Thread Space: Lazy Commit

FD-STREAM Buffers: 64 KiB on Win32

Win32-specific Futex Support

:UCS-2 External Format For Console I/O

GC Internals: Work in Progress

Merge Status at the End of 2012

Since Spring 2012, I was unable to work on SBCL (I managed to make occasional merges, but that's all). Hence no new code, no new ideas... but it turned out to be, in a sense, beneficial — while there was no new development, David Lichteblau started to integrate the entire thing into SBCL mainline!

It was rather non-trivial peace of work, not at all mechanical. Many times he had to refactor my code, moving cross-platform things out of platform-specific files, throwing away garbage, cleaning up the rest. Now I am positively amazed at how much code was accepted. David was very careful not to throw away anything useful.

Remaining differences may be categorized the following way:

historical leftovers in my codebase vs. cleaner David's rewrites. Example: SB-SAFEPOINT thing being Windows-only in my build, now cross-platform in mainstream.
features yet to be integrated, important enough to justify the existence of my fork for a while. Example: RUN-PROGRAM improvements (some of them could be rewritten better, some are trivially mergeable — like argument escaping).
features that are useful, but maybe not enough useful to integrate their tricky implementation as-is: e.g. I'm unsure whether shared memory-mapping of core files on Windows is worth the hassle (there's too much special-casing in memory managing code).
features that ended up with different imlementation, reflecting different opinions of David and me. Example: foreign thread callbacks are supported both here and there; my variant is windows-only, David's one is cross-platform. His version is better for some use cases, mine for some others.
a bunch of unimportant (mostly performance) hacks in my tree, that should be elsewhere. I'll either port them to current code or throw them away.

Features Merged before SBCL-1.1.3

Windows/AMD64 port
Threads, timers and interrupts on Windows
Safepoints (cross-platform rewrite)
STDCALL convention for callbacks
Long file name support
SB-BSD-SOCKETS:NON-BLOCKING-MODE
Control-C handler for Windows console
Unicode console I/O
Using file handles + kernel32 stuff instead of MSVCRT handles
Linkage table for former "static" symbols

Probably To Be Merged Later

RUN-PROGRAM rewrite
Memory-mapped core file

Probably Never To Be Merged

Windows'2000 support
- I'm unsure I want it enough to continue testing. I definitely don't need it now. Please let me know if you use SBCL on Win2k.
GetWriteWatch instead of userspace page fault handler
- Theoretically, it can be a notable performance gain, but I have no way to know: I build and test on Wine, not on real Windows.

For Happy Users of my Fork

As you can see, Windows support in the upstream SBCL made a big progress. That's especially true of the upcoming SBCL-1.1.3 release. Don't forget to give it a try when it's available. Please let me know if any differences (upstream vs. fork) affect your use cases, for better or worse, especially if the differences aren't documented here yet.

If you test for some *FEATURES* of my fork, please get ready for some changes (I won't maintain a separate code base for such minor reason):

SB-GC-SAFEPOINT becomes SB-SAFEPOINT
- As I now use SF-SAFEPOINT too, migration is trivial.
FDS-ARE-WINDOWS-HANDLES disappears
- use SB-WIN32:GET-OSFHANDLE and SB-WIN32:OPEN-OSFHANDLE to abstract away. They became noops when IO migrated to kernel32.
ALIEN-CALLBACK-CONVENTIONS, ALIEN-CALLBACK-STDCALL, ALIEN-CALLBACK-CDECL are local for now, but maybe they will get merged. To work everywhere, check (FBOUNDP 'SB-ALIEN::ALIEN-FUN-TYPE-CONVENTION) instead (or use some other similar hack).
Did I forget something here? Let me know.

I have to Apologize...

Two months ago I intended to restore cross-platform SBCL buildability as soon as possible, to stash away non-critical patches (those not directly related to win32 threading), and to provide a version ready for integration. Not that I forsaken those plans (hope to come back to these tasks in January), but I've postponed them significantly.

The explanation is trivial: some ideas that occured to me while I worked with SBCL code seemed so interesting that I felt a great urge to try them out.

There is one good thing about it, however: integration with SBCL upstream didn't become problematic during this time. Some code in my branch is now much cleaner than it used to be, so reintegration may be even easier now.

This text is written to describe present changes in the ``bleeding edge'' branch of my repository (tls63, which is now the default branch), relative to upstream SBCL and to Dmitry Kalyanov's code. For now I'm not yet trying to decide what I propose for integration, or to predict which patches will be useful in the long run.

Only the changes done on purpose are described here; bugs and other unintended things don't belong to this document.

Current Changes

Thread-Local Data Storage

Tls63 branch started with the idea to keep thread-specific data not in the arbitrary data slot in NT TIB, but in the `legally owned' place allocated by TlsAlloc().

Slot 63 is the highest TLS slot unconditionally available in TIB (slots above are allocated on demand when TlsSetValue is called). The offset of slot 63 in TIB is fixed (of course, modulo the `unofficial but stable' status of NT TIB layout itself). Machine code accessing slot 63 is as simple as one accessing arbitrary data slot.

TLS slots are allocated from lower to higher ones. SBCL runtime has several DLL dependencies; DllMain() entry points of those DLLs may allocate some TLS slots before SBCL's main() starts.

Current SBCL runtime in tls63 branch relies on slot 63 being free when runtime's main() is called. It's almost guaranteed to be this way with our current DLL dependencies (among which kernel32 and ws2_32 have special areas in TIB for their use, so they don't use TLS slots at all). If slot 63 happens to be busy, SBCL runtime will refuse to start.

If slot 63 is free, SBCL runtime allocates it (there is no official way to allocate TLS slot with known index; hence SBCL allocates all free slots up to 63, then frees lower slots).

Summary. Instead of the TIB arbitrary data slot, SBCL runtime in tls63 branch uses the resource with a known ownership protocol; while conflicts are theoretically possible, undetected conflicts aren't.

Foreign Thread Callbacks

What happens when SBCL runtime function is called in a thread that was not created by SBCL (and isn't the initial thread "adopted" by it)?

Real (upstream) SBCL does something smart in the special case of signal handlers: signal is automaticall delivered to one of SBCL native threads (it has nothing to do with win32: no signals there).

As of FFI callback functions, defined using SB-ALIEN::DEFINE-ALIEN-CALLBACK or SB-ALIEN::ALIEN-LAMBDA, calling them in foreign threads is guaranteed to fail in real SBCL.

Tls63 branch on win32 is different: it provides support for calling alien callbacks in foreign threads. Unfortunately, porting the solution to other platforms is non-trivial; only partially it's caused by my (mis)design decisions — there are some objective reasons why the problem should be solved differently on win32 and on other platforms.

Why bother? Well, win32 is special here — as usual. Even some trivial system programming tasks, like writing "services" (daemons), use foreign thread callbacks on this platform.

From the user's point of view, foreign thread callback support doesn't look complicated: when a foreign thread calls Lisp callback for the first time, it becomes known to SBCL; after that, it's visible in (SB-THREAD:LIST-ALL-THREADS) results (its automatically set THREAD-NAME reflects the fact that it's an autodetected foreign thread, and includes system thread identifier). JOIN-THREAD may wait for foreign thread exit, and THREAD-ALIVE-P may be used to test if a foreign thread already exited.

Internals of the foreign thread callback support are more complicated: the basis for win32 implementation is MS Windows system support for userspace-scheduled thread-like things called fibers. I've modified pthreads_win32 module, so it supports fibers as well as threads, assigning distinct pthread_self() identity to both kinds of objects. When foreign thread callback is detected, two things happens:

Fully initialized Lisp thread is created; unique feature of this Lisp thread is that its OS-THREAD represents a fiber.
The fiber mentioned above is scheduled in a foreign thread, and a callback is run in it.

Further discussion of foreign thread callback implementation internals is outside the scope of this document.

Stdcall Callbacks

MS Windows interfaces use callbacks frequently, to the extent of looking bizarre to an experienced Unix programmer. That's why foreign thread callbacks, described in the section above, are important.

There is another sad story about callbacks, totally unrelated to threads. Most WinAPI functions on X86 use a calling convention named stdcall (arguments are passed in the stack, called function removes them before returning). Usual default convention of X86 C compires is called cdecl (arguments are passed in the stack, too, but called function doesn't remove them).

SBCL foreign function interface for X86 is designed to call external functions without requiring convention specification. SBCL will just do the right thing when the foreign function is either stdcall or cdecl. The method used to achieve it is simple and elegant: SBCL grants to all foreign functions an unquestionable right to clobber %ESP register. Code generated by SBCL for foreign C function calls normally saves %ESP before the call and restores it afterwards.

Unfortunately, there is still a problem with callback calling convention, and it cannot be solved without some way to specify a convention (it's impossible to determine if our caller expects us to clean up the stack).

In SBCL upstream, all alien callbacks are cdecl. Consequently, on Windows we have a problem: not only WinAPI functions are defined to be stdcall, but almost all callbacks that they invoke are required to be stdcall as well.

I decided to provide a way of specifying stdcall convention for alien callbacks.

It's interesting that the same problem was solved once by Alastair Bridgewater. He added calling convention to the callback specification syntax; when two callbacks differ only in calling convention, they had exactly the same alien function type.

I decided to go a bit further: in tls63 code, alien function type contains calling convention specification. NIL convention supposed to mean ``universal'' for callouts and ``cdecl'' for callbacks.

:STDCALL and :CDECL keywords may be used to specify callback convention explicitly; currently, they affect X86 code generation, but silently ignored on other platforms.

The greatest problem I faced when working on this thing was mostly an aestetic one: how to add convention spec to alien function type syntax without breaking compatibility? So far, my chosen syntax looks like this:

(ALIEN-LAMBDA INT ((x INT)) (1+ x));; traditional spec, NIL convention
(ALIEN-LAMBDA (:STDCALL INT) ((x INT)) (1+ x));; the same function, but :STDCALL
(ALIEN-LAMBDA (:CDECL INT) ((x INT)) (1+ x));; the same function, but :CDECL
(CAST ... (FUNCTION (:STDCALL INT) INT));; standalone function type spec.

As you see above, calling convention is specified by replacing result type with a list of convention keyword and result type. This syntax is deceptive, in a sense: it looks like if convention belonged to result type, but it doesn't. There is no such alien type as (:STDCALL INT).

At least, my chosen syntax is not ambigious (it would become ambigious if (:STDCALL INT) were parsable into an alien type spec; but defining alien type parsers is outside public SBCL interface).

One thing that may change in the future: alien function types with the same result type and the same argument types, but different conventions, are now disjoint.

Memory-Mapped Core on Win32

Every day when I was reading SBCL sources, I pondered upon os_map() comment in win32-os.c: we copy core file data instead of mapping because "Windows semantics completely screws this up". As turned out after some experiments, the word completely is too harsh; I've managed to implement copy-on-write mapping of core files — and it was not the hardest of all things mentioned in this document; e.g. foreign thread callbacks were certainly harder.

MS Windows provides CreateFileMapping and MapViewOfFile functions; together, they resemble our familiar mmap(). Here is the list of problems that had to be solved for core file mapping:

Alignment requirements for file and memory offsets in MapViewOfFile are stricter than required by mmap(): os_vm_page_size alignment is insufficient. File mapping on Windows uses the same unit as MEM_RESERVE VirtualAlloc, called ``allocation granularity'' (GetSystemInfo call retrieves its value, together with other interesting things).
- Solution: modified SBCL's core dumper — now it aligns memory space snapshots at 64*page boundary, that is stricter than realistic GetSystemInfo granularity (I observe the latter's usual value to be (ash 1 16) = 64Kb or 16 pages).
For non-anonymous file mapping, the same virtual memory region can't be reserved/committed by VirtualAlloc and used for fixed-address mapping simultaneously.
- Solution: virtual memory reserved for mmapped space is first unreserved, then its appropriate part is mmapped, then the rest part is reserved again.
For the same memory-protection semantics (as seen by SBCL), different VirtualProtect flags are required for file-backed virtual memory and for virtual memory allocated by VirtualAlloc.
- Solution (obvious, and obviously non-unique): just remember what ranges of virtual memory are mapped, and adjust VirtualProtect arguments accordingly.
File-backed virtual memory can't be decommitted or committed back.
- Solution: the only case where it really matters is when GC zeroes free pages by invalidating/revalidating them. Now I resort to fast_bzero for mmapped regions, retaining VirtualFree/VirtualAlloc for other space.
Old MS Windows versions either reject ``executable'' permission requested for file mapping (and mapped memory) or don't know about it, while modern versions may really need it if we want to execute code located in the mapped memory.
- Solution: try executable mapping first, fallback to non-executable second; remember which method succeeded; adjust VirtualProtect arguments (when modifying mapped memory protection) according to this information.
Cross-compiler core dumper still generates core with misaligned space offsets.
- Solution: I continue to support non-mapped core; when mapping is impossible, os_map falls back to copying, as it used to do unconditionally. Copying method works fine with misaligned cold-sbcl.core; it may be helpful as well if a file system driver refuses to mmap (the only danger related to the latter class of situations is a seemingly successful MapViewOfFile that fails on memory access. By now I've never observed such a thing when the file is readable without errors; and when an I/O error occurs, copying os_map would fail as well).

I decided to simplify the implementation by restricting memory-mapping to dynamic space; other spaces are always copied. Their size is unnoticeable when compared to dynamic space size, so why bother?

User-visible results of this change:

SBCL process with a huge core image now starts faster than it used to;
The process working set size doesn't include all of the core, only the pages that were accessed;
Pagefile space is not consumed for unchanged memory-mapped pages (copy-on-write policy).
Multiple instances of SBCL with the same memory-mapped core share some of their working sets.
But it all comes for a small fee: memory-mapped core file is considered an open file by the OS, hence you can't delete or replace it while it's used by running SBCL instance. That's why I'm going to eventually provide a runtime option for disabling/enabling core file mapping on a case-by-case basis.

Foreign Symbol References in Cross-Compiled LISP-OBJS

When I worked on some IO-related stuff, requiring me to modify src/code/win32.lisp frequently, one thing was annoying me, build after build: when I added foreign function reference to Lisp source, and this function wasn't used in runtime C code, SBCL failed to build.

Current method of resolving foreing references used in SBCL upstream requires any foreign symbol mentioned in cross-compiled Lisp files to be referenced by SBCL runtime too. Details are different for different platforms: Unix has ldso-stubs.S (and the tool to regenerate it), Win32 has win32-os.c:scratch() function.

After some days of that scratch()-induced torture, I decided to reshape SBCL's build process in a way that would remove any manual maintenance requirement for foreign symbol references. Of course, the first place where I looked for some guidance was the aforementioned tool for generating ldso-stubs.S. It's hard to describe my disappointment when I've found the same hand-written foreign function lists in tools-for-build/ldso-stubs.lisp.

Splitting GENESIS-2: Compile-Lisp-OBJS, Collect-Refs, Relink-Runtime

My first method of avoiding win32-os.c:scratch() maintenance was rather simple: do genesis-2 in two passes. The first pass (called genesis-1a in my code) cross-compiles Lisp sources into obj/from-xc; then it cold-loads resulting objects, with fixup-resolving FOP redefined to collect :foreign fixups.

After the first pass, foreign symbol names (with FOREIGN_SYMBOL_REFERENCE() around) are put into src/runtime/gen1a-undefs. Runtime executable is linked after this file is ready; one of its source files (win32-stubs.S) includes gen1a-undefs, after defining FOREIGN_SYMBOL_REFERENCE preprocessor macro, so it expands into some stuff like ``.long ForeignFunctionName'' (assembler is preferred to C here because the latter requires some heavy GCCisms to refer a name like CloseHandle@4).

Now runtime is ready; its symbol table is in sbcl.nm. The second half of genesis-2 (retaining the name genesis-2 in my branch) reads sbcl.nm symbol table, then cold-loads every lisp-obj again. This time, thanks to gen1a-undefs, every foreign fixup is resolved, and genesis-2 generates output/cold-sbcl.core. From this point on, build process continues in an unmodified way, like in upstream SBCL.

As of Unix platforms, reusing genesis-1a output described above is as easy as redefining FOREIGN_SYMBOL_REFERENCE to LDSO_STUBIFY. The only doubt I have on this approach is this:

What if SBCL maintainers consider hand-written function lists a good thing?

The Alternative: Dynamic Resolving of Runtime Symbols

The method described above continues to work. The second method described below may provide some important goodies to someone who is developing SBCL runtime, continuously changing and recompiling it (that's why I prefer the second method for my own builds).

As long as foreign symbols in Lisp-OBJS are resolved using the information from src/runtime/sbcl.nm, modified runtime rebuild always requires core regeneration. Imported static symbol addresses cannot remain the same in the new core except by coincidence. What if we use some indirection method for linking against runtime — something like linkage-table used for shared objects? It turned out to be possible; win32 builds from tls63 branch with :sb-dynamic-core enabled work exactly this way.

With :sb-dynamic-core, genesis-2 doesn't use any information about real runtime symbols when cold-sbcl.core is created. Each foreign fixup in Lisp-OBJS is ``resolved'' to an address from the special memory area. Names of foreign symbols ``resolved'' this way are collected into a list; this list ends up in cold-sbcl.core as a SYMBOL-VALUE of SB-VM::*REQUIRED-RUNTIME-C-SYMBOLS*.

For :sb-dynamic-core builds, SB-VM::*REQUIRED-RUNTIME-C-SYMBOLS* is a newly-introduced static symbol; its (constant) address is available to runtime among other GENESIS data.

Here is the most unusual part of the story: real name resolving is done by the runtime itself. Upon startup, the runtime iterates over (SYMBOL-VALUE SB-VM::*REQUIRED-RUNTIME-C-SYMBOLS*); each name in this list is resolved, and the result is stored (in a manner resembling linkage-table implementanion) in the special memory area mentioned above. The order of symbols in the list is the same as the order of address allocation in the special memory area; that's why each symbol address is stored in the place where the core expects to find exactly this symbol.

How is a foreign symbol name resolved by the runtime? Well, it's a banality: runtime calls GetProcAddress(). Oh, some missing details:

When runtime is linked, the linker is instructed to create an export directory in the executable, and to place all defined public symbols into it (-Wl,-export-all-symbols for MinGW's gcc).
Runtime gets its own HINSTANCE with GetModuleHandle(NULL) and uses it as an argument for GetProcAddress.
Runtime finds its own dynamic import directory; from that directory, it retrieves a HINSTANCE value for each of its directly-imported DLL. If GetProcAddress returns NULL with runtime's own HINSTANCE, runtime retries resolving with each of those additional HINSTANCEs. Static references to kernel32, MSVCRT or Winsock2 symbols are successfully resolved at this stage.

The detail omitted earlier for the sake of simplicity: exactly as for the real linkage-table, we have to distinguish function references from variable references; the latter kind also requires some special handling in the client code (i.e. in the core itself). These two requirements are satisfied by modified FOREIGN-SYMBOL-SAP VOP: its xc-host version (with :sb-dynamic-core enabled) now generates :foreign-dataref fixups, that are normally disallowed in cross-compiled Lisp code. SB-VM::*REQUIRED-RUNTIME-C-SYMBOLS* is not just a list of names; for each name, GENESIS-2 records an additional value to distinguish data and function references.

Summary. When sb-dynamic-core is enabled, recompiled runtime continues to work with earlier-generated cores; however, two requirements for core and runtime compatibility are still in place:

GENESIS data (everything in src/runtime/genesis) shouldn't change since the generation of the core.
Foreign symbols used in the core should be exported by the runtime. When symbol reference is generated by EXTERN-ALIEN (or something with EXTERN-ALIEN inside, like DEFINE-ALIEN-ROUTINE), you have a chance to get a diagnostic message mentioning undefined alien; when missing symbol represents some fundamental runtime support function, like call_into_c, most probable outcome is a crash.

SB-DYNAMIC-CORE Meets LINKAGE-TABLE

Previous section is obsolete in one respect. There is no ``special VM area'' for dynamic linking against the runtime now: all runtime symbols are registered in the common linkage-table.

With :SB-DYNAMIC-CORE enabled, therefore, there is no ``static symbols'' that require special cases here and there; all foreign symbols are linked the same way. The difference is restricted to initialization of linkage-table entries: the Lisp code managing foreign libraries can't run before the runtime is available, so the runtime has to ``preseed'' linkage-table by itself.

Linkage-table on Win32

In SBCL upstream, linkage-table support for win32 is very close to being complete. A couple of places related to linkage-table are unnecessary conditionalized on #!-win32: e. g. (update-linkage-table) just works on win32, if only permitted to try.

Long before my experimentation with runtime linking methods described in the previous section, I ensured that #!(and win32 linkage-table) support is on par with other platforms. Error handling for undefined alien variable references is among the things I had to add, but its reimplementation for win32 turned out to be a trivial task.

Some code related to linkage-table, ex. (LOAD-SHARED-OBJECT), is problematic for win32, but the problems aren't win32-specific: overlooking dlopen/dlclose refcounting issues prevents library unloads, but the nature and presence of this problem doesn't change at all when LoadLibrary() and FreeLibrary() become the real functions behind dlopen/dlclose.

When I started to build runtime executable with an export table, I had to remove #!-win32 conditionalizations of *RUNTIME-DLHANDLE* operations. *RUNTIME-DLHANDLE* has very close win32 equivalent, namely, the result of GetModuleHandle(NULL). In current tls63, initialization and usage of *RUNTIME-DLHANDLE* is done on win32 in the same way as on a typical Unix platform (well, except a no-op instead of dlclose'ing it: GetModuleHandle doesn't affect refcount, while dlopen(NULL) does).

Surviving Data Execution Prevention

Data Execution Prevention (DEP) is an optional feature of modern MS Windows OSes, intended to protect running programs against some common methods of malicious code injection. When DEP is available, system administrator may disable it, enable it for some system services, or enable it for any executable (minus list of exceptions). Of course, any program that fails when DEP is enabled may be added to aforementioned list of exceptions; however, as a rule, application developers don't express too much enthusiasm when an executable suddenly requires system administrator intervention to start working. That's why successfully working with DEP enabled would be a good thing for SBCL.

SBCL port for Windows has only one problem with DEP (solved in tls63). Though ``Data Execution'' is exactly the thing SBCL is doing all the time (and also the thing Common Lisp is all about), DEP doesn't complain, because all those data are explicitly marked executable. However, when SBCL registers an assembly routine called UWP-SEH-HANDLER as a SEH handler, DEP presence causes SBCL to die when the handler is about to be invoked. And it is invoked on any unwinding through UNWIND-PROTECT — quite common thing to happen.

Probably, once upon a time, DEP authors either anticipated or experienced some technique of DEP circumvention involving unexpected SEH handler installation. While the main functionality of DEP is easy to understand (for CPUs with NX bit: set NX bit for non-executable pages — end of story), additional layers of ``DEP circumvention prevention'' may bring some surprises, as we can see. Here is the bottom line: DEP normally requires SEH handler function to be in an executable section of an EXE or a DLL.

After some hours of striving with RtlUnwind, I've finally found some information about the aforementioned restriction. After that I added a similar SEH handler implementation to src/runtime/x86-assem.S, and made (SET-UNWIND-PROTECT) install the new handler. DEP accepts its new location as ``safe''; tls63 build of SBCL doesn't die anymore.

Special Handling of Console FDs in read() and write() equivalents

MSVCRT provides _read() and _write() functions, that are used in SBCL upstream as a drop-in replacement for Unix system calls (or libc wrappers around them). When Dmitry Kalyanov added win32 threading support to his branch of SBCL (that is the basis of my own one), he discovered that MS Windows synchronous I/O has a significant restriction: outstanding _read() blocks attempted _write() (or, to be precise, outstanding ReadFile() blocks attempted WriteFile(): this restriction is not related to MSVCRT ``lowio'' layer; its ultimate cause is the NT kernel, and NtReadFile(), NtWriteFile() are affected as well).

Multiprocessing Lisp code frequently uses separate threads for reading and writing the same stream, and never expects write operation to block until read operation completes. SLIME in multithreaded mode contains such code; SLIME is also the most obvious candidate for initial testing of threading support.

Dmitry made an initial reimplementation of _read() and _write() to prevent mutual blocking of reader and writer threads. That's how win32_unix_read() and win32_unix_write() first appeared in win32-os.c.

My acquaintance with SBCL runtime started with those two functions; my first changes were to make them use OVERLAPPED I/O mode when it's possible for a given FD (i.e. for an underlying file handle).

Later I worked on those functions to make them interruptable (such that SB-THREAD:INTERRUPT-THREAD makes I/O call in the target thread to return EINTR; in a typical case the flow of control leaves WITHOUT-INTERRUPTS section in the guts of REFILL-INPUT-BUFFER, and pending thread interruptions have a chance to run (unless interrupts are disabled around REFILL-INPUT-BUFFER as well).

Until recently, win32_unix_read() and win32_unix_write() were interruptible only when a subject file supported OVERLAPPED I/O. One recent step further: console device reading.

On MS Windows (as of NT family, which is the only realistic target of SBCL win32 port), console device handles differ from all other file or device handles in many ways. Normal handles refer to NT kernel entities (think "NTDLL" and "system calls"); console handles are internal to KERNEL32.DLL (think "userspace", despite the misleading library name). KERNEL32.DLL tries to create an illusion that console handlers aren't different:

CreateFile emulates the presence of``special devices'' like "CONIN$" and "CONOUT$";
DuplicateFile and CreateProcess support console handle duplication and inheritance;
ReadFile and WriteFile calls are translated to ReadConsoleA and WriteConsoleA;

Console handles are now handled specially by win32_unix_read and win32_unix_write:

Instead of using ReadFile or WriteFile, we translate read/write I/O to ReadConsoleW and WriteConsoleW. As we use Unicode functions here, Lisp-side fd-stream initialization code doesn't have to do anything with ``console codepage'' settings: external format for console streams is always :UCS-2.
Unlike any operation with NT handles, ReadConsoleW is interrupted immediately if the handle is closed by other thread. This fact is now used to interrupt win32_unix_read (see wake_thread()) during console input.
CR is removed from console input bytes, providing SBCL-compatible line endings to the Lisp layer (LF only). The probability of further changes here is rather high (maybe we should check for line-input mode using GetConsoleMode(), etc.).

Backtrace with Foreign Function Names

Lisp backtrace inevitable includes some foreign functions, at least when C-STACK-IS-CONTROL-STACK. Typical backtrace item corresponding to such a function sets a new standard of informativeness and beauty:

("foreign function: #x40CDD0")

Some evil people among SBCL developers have already started an effort to turn this hacker's delight into some boring thing, like a function name (search SBCL sources for SAP-FOREIGN-SYMBOL to learn further details). It's interesting that I can't imagine a single case when (upstream) SAP-FOREIGN-SYMBOL successfully finds the name of C function that called our Lisp code: it tries to find an address range enclosing the argument in the linkage table space, and if it ever succeeded with a return address from the frame pointer chain, it would mean that the CALL instruction is contained in linkage table. Well, maybe I've overlooked something here.

I added another chunk of code to SAP-FOREIGN-SYMBOL (currently conditionalized on #!+win32). Among foreign symbols exported by the runtime, it finds the highest address that is still lower than the argument address. Foreign symbol corresponding to the address found is taken as a foreign function name, and the difference between the argument and the search result is taken as an offset within that function. Name and offset are formatted into a string.

When our (ancestor) caller happens to be a C function from SBCL runtime, (backtrace) result items now look like this (last two items):

22: (SB-THREAD::INITIAL-THREAD-FUNCTION)
23: ("foreign function: call_into_lisp +#x6C")
24: ("foreign function: funcall0 +#x2D")

My code in (SAP-FOREIGN-SYMBOL) may be improved in many ways. Well, at least it finds something, and its result is correct in the common case when the caller is, indeed, an exported function of SBCL runtime.

Most Questionable Hack: Lisp/C FPU Context Switch

C calling conventions on X86 requires FPU stack to be empty on call entry and provides 0 or 1 values on call exit (the latter is the case when the function result is returned in FPU stack). SBCL code, when it's working with X86 FPU, requires FPU stack to be full (no empty registers as described by the tag word).

SBCL uses FPU rather frequently during its normal work, even when the task running has no explicit floating-point calculations; hash table internals use FP operations, for example. On the other hand, a vast majority of runtime C code, as well as a lot of available libraries, don't use FPU at all (i.e. they are unaffected by FPU stack and don't modify it).

I had a bad feeling about those FLDZs and FSTPs on each foreign call. Assuming that Windows handles contexts switches like other modern OSes on X86, there are two possible scenarios: either it sees SBCL as eligible for lazy FPU context loading (thus setting TS bit in CR0 as SBCL is running), or it doesn't (thus making SBCL the `owner' of the FPU instantly). While it goes unnoticeable when no other tasks are actively using FPU, it could cost a couple of context switches per C call entry/exit when there are such tasks (even SBCL's own threads).

The right thing to do is, of course, fixing SBCL compiler so it doesn't require full FPU stack. Unfortunately, back then I was more familiar with SEH and X86 FPU than with SBCL compiler internals; that's why I added some exception-handling code to do an equivalent of lazy FPU switching in userspace.

The idea is that when :INVALID trap is enabled, we catch both Lisp attempt to pop from the empty FPU stack and C attempt to push into the full one. The most tricky part of the solution is recovering after FPU exception, as it can't be restarted by simply returning ExceptionContinueExecution: failed FPU opcode should be retrieved and reinterpreted/restarted separately.

As long as no one disables the :INVALID trap, the solution works as intended, allowing Lisp code to run with empty FPU stack (and C code to run with full FPU stack) until CPU hits a FP instruction.

This part of code is to be dropped without regret as soon as the compiler is fixed.

GC and GetWriteWatch

Windows'2000 SP3 and above provide an interesting feature: if we reserve memory with MEM_WRITE_WATCH flag, all written-to pages in the reserved region are tracked. GetWriteWatch() function may be used to retrieve a list of dirty pages. This API is documented as intended for GCs, and actually used, when available, e.g. in Boehm GC and in .NET.

Tls63 branch now detects GetWriteWatch presence and uses it for the same purpose, instead of write-protecting pages from the dynamic space and then trapping into SEH in order to unprotect written page. GC requests for write protection are translated into ResetWriteWatch(), clearing the list of written-to addresses; at the beginning of collect_garbage(), all written-to pages that are `write protected' this way (per GC page table) are marked as unprotected.

As far as I remember, write-tracking is not just a sugar on the same wp-fault-unwp sequence (done in kernel), but a CPU-level feature, modifying page table on first write without causing any exception. There are several possible strategies of using this API; the one I've selected is not necessary the best.

Unfortunately, write tracking is unavailable for memory regions populated with MapViewOfFile(); therefore it's used for all dynamic-space pages above the memory-mapped core, while the traditional SEH (and real write-protection) is used inside the memory-mapped core space. There are some alternatives to consider here as well.

32 KiB BACKEND_PAGE_BYTES

Following SBCL upstream changes for other OSes, I switched to 32KiB page size as a default for Win32, fixing the code that assumed os_vm_page_size to be a system page size. If code simplicity were a priority, I would make it 64 KiB — the allocation unit. However, I considered debugging and testing the case of all-three-unequal BACKEND_PAGE_BYTES, dwPageSize, and dwAllocationGranularity a good thing in itself.

Thus there is no restriction on BACKEND_PAGE_BYTES by allocation granularity: everything up from 4KiB is still available.

Thread Space: Lazy Commit

Tls63 branch allocates the chunk around struct thread (i.e. dynamic values, alien space, binding stack...) using MEM_RESERVE, precommitting only a couple of pages here and there when they're known to be used. All the rest is committed by the exception handler on demand.

As of the control stack, that can't be the part of struct thread on win32, this branch:

doesn't reserve 2MiB in struct thread for the control stack, as we can't move it anyway (and, considering DEP — see section above — we can't move it, definitely);
lets CreateThread() use default executable settings for thread stack, reserving 2MiB but committing one page.

There is a problem with lazy-commit approach to alien stack: some system functions verify user buffers and reject uncommitted memory without running our exception handler. The dual workaround for it is provided (1) by the VOP used for allocation: it always `touches' the topmost byte allocated, (2) by the exception handler: for a page fault in alien stack, all intermediate pages up to the bottom are committed as well.

FD-STREAM Buffers: 64 KiB on Win32

File buffer allocation code (the same in tls63 and the upstream SBCL) ends up in os_validate, thus having 64 KiB allocation granularity. Until it's modified to use malloc() instead, smaller FD stream buffers on Win32 are wasting space (15/16 of it for the original default of 4096 bytes).

For now, tls63 uses 64 KiB as a default buffer size. Switching to malloc() is equally easy, of course (and probably should be done anyway).

Win32-specific Futex Support

When I though about SBCL's WAITQUEUE and MUTEX, implemented with futex_wait() and futex_wake(), implemented with pthread_mutex() pthread_cond_timedwait(), implemented with (...) some Win32 API, there seemed to be way too many layers. Implementing Win32 futexes resulted in better performance and (more importantly) better support for interrupts: there is no `uninterruptible quantum' in the new futex implementation, and pthread_kill(), now being part of the same module, is able to do its best.

:UCS-2 External Format For Console I/O

Done exactly as planned and stated in the initial announce page.

GC Internals: Work in Progress

Current implementation of safepoints, interrupts, and GC signalling in Tls63 departed significantly from the original Dmitry Kalyanov's code. It should be considered an experiment, with the possibility that the result will be eventually thrown away, rolling it back to the original.

However, there are some logical changes, discovered during this experiment, that we should apply to the original code in this case:

It's too late to enter unsafe region (announce the thread as becoming a mutator) in call_into_lisp(). Calls into lisp from the runtime are normally done using a StaticSymbolFunction(); and, though the symbol is static, the function may be moved by GC at the moment of calling StaticSymbolFunction(), before call_into_lisp is entered. A good marker for the real points where unsafe region should begin are existing calls to fake_foreign_function_call(); blocking GC signals (from Unix world) is a logical equivalent of becoming GC-unsafe (for Win32 safepoint-based build). Calls like alloc_sap() and other pa_alloc() wrappers can't be done from supposedly safe region as well.
The problem described above was (for the most cases) concealed by having gc_safepoint() call waiting until current GC ends. However, as long as nothing is done to prevent another GC from starting between gc_safepoint() and call_into_lisp(), it may happen.
For proper memory visibility there should be either a memory barrier when entering/leaving safe region or some kind of `injected memory barrier' during gc_stop_the_world(). E.g. SuspendThread(), GetThreadContext() and ResumeThread() will cut it — not when GCing (it's too late), but when stopping the world. (I avoid SuspendThread() et al. altogether in my code — it's a debatable design decision caused by portability issues, e.g. my desire to support Wine — but then I have to use memory barriers in foreign call entry/exit).
Handling SEH stack unwinds as possible points of entering unsafe region is, again, too late. There is a single scenario where it could be useful, and neither my current code nor the original code can handle it right: unwinding from a foreign call to an outer foreign call with a Lisp alien callback (containing some unwind code) on the way. For all other possible scenarious, stack unwind, as it seems, shouldn't affect `gc safety status' at all (that's why I don't bind my `gc safety indicators' any more, by the way).