Debugger late attaching

I often want to debug processes launched by another process. Somehow I usually find myself interested in the early part of the launched process. Problem is, the interesting bit has often passed before I manage to attach. In these scenarios it’s useful to temporarily add these 3 lines of code before the interesting bit.

while (!IsDebuggerPresent())
    Sleep(50);
__debugbreak();

// interesting bit

I often use this trick to launch processes from a CLI instead of via Visual Studio, just to avoid having to copy paste and alter arguments in project properties. On that note, if someone knows of a good command-line / debug arguments plugin for Visual Studio, please leave a comment below! I use Martin Ridger’s excellent clink to get bash style ctrl+r searchable history on Windows btw.

Compile time sizeof / alignof

Manually adding up member sizes or figuring out alignment is for chumps! And compilers. Fortunately you can avoid being a chump by abusing your compiler to make it print the sizeof() in a compilation error.

You: template<int> class X; X<sizeof(SomeType)> _;

Compiler: ‘_’ uses undefined class ‘X<48>’

You: I’m so sorry, I totally thought that would compile.

Update

Visual Studio can show the result of sizeof-expressions by hovering over them (see reddit comments). The feature seems to rely on IntelliSense, which unfortunately is too slow for me to use. Additionally I often code in a stand-alone text editor and build from the command-line. Forcing a compilation error that tells you the size is IDE-independent and works on other compilers too.

Prefer SRW locks over Critical Sections

This post explains why a Slim Reader/Writer lock (SRWL) is often preferable over a Critical Section (CS) when writing Win32 applications.

Slim

SRWL is 8 bytes on x64 while CS is 40. CS requires setup and teardown via kernel calls while SRWL is zero-initialized via SRWLOCK_INIT assignment. If you expect little contention and performance doesn’t matter, SRWL generate smaller code and consume less memory.

If you have 100,000s of objects with some internal lock, the reduced memory consumption itself may matter. The performance impact of avoiding cache misses is usually even more important. Since Intel Nehalem released in late 2008, the cache line size of modern x64 processors have been 64 bytes. Spending 40 of those bytes on a lock seriously hurt data locality for smaller objects.

Fast

First off, the SRWL implementation or at least the underlying kernel code has changed over the last few years. Older benchmarks might be outdated.

Both CS and SRWL spin in user-mode before falling back to a lightweight sleep mechanism, NtWaitForAlertByThreadId(). Only CS support tweaking spin time. I haven’t analyzed the implementations further than that.

Nor have I tried to create any artificial benchmark to compare speeds. Real world parallel performance is too messy.

What I will share is an anecdote. I’ve tried switching between CS and SRWL in 20 or so contended scenarios. SRWL has always been faster or as fast and have often improved wall time performance noticeably.

I won’t provide any numbers. Amount of work done while locked, lock granularity, parallelism level, contention level, read-to-write ratio, cache pressure, CPU and other factors have too much impact to make numbers interesting.

I’m not suggesting that SRWL is generically faster than CS. Profile your own workload and find out. Please consider sharing your findings in the comments.

Non-reentrant

This is a feature, not a problem.

Non-reentrant locks forces clear public boundaries and makes it easy to statically reason about lock acquisition order and deadlocks. Well, as long as you avoid stupid things like callbacks while holding a lock.

Reentrant locks are temporarily useful when parallelizing legacy code bases and you don’t want to refactor too much up front.

The original POSIX mutex was actually made reentrant by “accident”. I wonder how many threading bugs would’ve been avoided if reentrant mutexes hadn’t become mainstream…

A thread that write-acquires the same SRWL twice will “deadlock” itself. This makes it simple to detect and fix mistakes. Just look at the call stack. Thread timings do not introduce any indeterminism.

Recursive read-acquires used to cause “deadlocks” too, at least I’m 90% sure of it 🙂 Unless I’m mistaken, Microsoft silently changed the behavior either in some update or between Win8 and Win10. Unfortunately this implementation detail makes reentrancy mistakes more difficult to spot. Mistakenly nested read lock scopes lead to nasty threading bugs when the innermost scope releases the lock too soon. Perhaps worse, the outer scope might release the lock taken by another reader. You can add a thread-local bool to dynamically validate your read lock scope usage and disable it via macros per default. Microsoft’s SAL annotations for locks might help catch these bugs at compile time, but I’ve never tried them myself.

Reader parallelism

Parallel reads are quite common. CS prevents this parallelism.

Write-starvation

The downside of reader parallelism is write-starvation. SRWL do not promise write-preference nor any fairness. While CS don’t give any intrinsic fairness guarantees either, it doesn’t increase write-starvation risk by supporting parallel reads.

Windows thread scheduler provides some fairness via round-robin when waking up threads. This helps when a lock is held for a long enough period that all blocked threads finish their user mode spinning. Don’t rely on any implementation details though.

If writer progress is crucial, neither CS nor SRWL is suitable as the sole synchronization mechanism. Higher level constructs, like producer-consumer queues, might be preferable over locks in these cases.

Concurrency Runtime

concurrency::reader_writer_lock give stronger priority guarantees than SRWL and is designed for cooperative threading. This comes at a price. In my experience they are significantly slower than CS and SRWL. They also weigh in at 72 bytes.

Personally, I think it’s way too automagic to execute jobs while trying to acquire a lock, but I guess it might suit someone. AFAIK, you can’t even opt out of it.

I have no experience with Intel Thread Building Blocks but I’m guessing parallel STL may replace both libraries in the future.

False sharing

The risk of false sharing is much larger for SRWL than CS since 40 bytes eat up most of a 64 byte cache line. Add some object state to that and the risk of having two CS on the same cache line drops significantly.

When creating lock arrays or sharding up a hash set to remove contention, remember to align each shard if it’s smaller than your target cache line size.

Do not align up by default. That reduces cache coherency and wastes memory. False sharing is rare even when using SRWLs. It is only an issue when multiple threads rapidly modify and read a limited set of objects at the same time. Given thousands of small objects with one lock each, the occasional false sharing is usually preferable to bloating all objects. When uncertain, profile to find out what is best.

Kernel bug

I should mention a kernel bug that caused me to lose a bit of confidence in SRWL and Windows in general. A few years ago Frostbite coders started noticing weird bugs where threads failed to acquire random SRWLs. This happened primarily on dual CPU machines but occasionally on single CPU ones. Debugging showed that no other threads held the lock. More surprisingly, continuing execution or stepping the blocked thread forward made the acquire succeeded. Attaching a debugger alerted all threads 🙂 After a long period of investigation and slowly getting repro rates down from days to half an hour I managed to pin it down and get it confirmed as a kernel bug that also affected IOCP and condition variables.

It took 8 months before we first noticed the problem until this hot fix released and even longer before it rolled out via Windows update. In the same year I found two Visual Studio 64-bit compiler bugs; this one where if-statements compiled to the wrong jump instruction and an ABI bug where a union with aligned members got the wrong size. 64-bit teething troubles I guess but not very confidence inspiring…

Final notes

In my experience most locks protect some object from occasional concurrent access. Contention is not the normal case. Both CS and SRWL have good instruction locality when acquiring and releasing an available lock. Keeping the object small to get good data cache locality usually matters more for performance than raw acquire/release speed. It also increases the chance that the lock and the protected data share the same cache line. The smaller size is the primary reason I routinely choose SRWL over CS.

For contended locks you should always measure your optimizations. Know your target cache line size and be wary of false sharing.

The most important contention optimizations will rarely be the lock itself. Holding the lock as short time as possible is what matters. Do all heavy lifting upfront. Consider using a separate outer lock array to avoid doing heavy lifting twice and use the inner lock for state protection only. Touch input data and incur cache misses before acquiring the lock. Avoid global heap allocations while holding the lock. Consider wrapping allocators with block or linear allocators. Reserve upfront when applicable. And so on. I might do a separate post on removing contention in the future.

Finally, non-reentrant locks makes reading existing code a lot easier. Reentrant locking is the goto of concurrent flow control.

A bit about me

Hello world.

I’ve coded C++ actively since 2000 and been in the games industry the last 7 years. I joined DICE in early 2009 to code AI for Battlefield Bad Company 2. I worked on AI, coop and CPU performance for BF3. For BF4 I lead the Engine/Core team that handled memory, packaging, CI, builds, tools, rendering, physics, audio, nasty bugs and more. I moved to Frostbite’s core systems group in 2014 where I’ve worked on our pipeline framework, structured data and reflection systems and I’ve lately been implementing our new asset database / snapshot filesystem.

I’ve been more focused on desktop and console software than server side code. Last few years I’ve had the pleasure of not splitting my focus across multiple platforms, even though I do miss the PS3 tool chain now and then. Being able to optimize for SSD and gigabit ethernet has been nicer than optimizing for optical media and slow HDDs. Unfortunately I’ve never got to do any GPGPU work.

I have a Haskell background from university and an interest in computer languages. I’ve followed Rust’s development since the early 0.1 alpha and look forward to a future where Rust can gradually replace C++. Since I’ve mainly worked in large codebases shared with 100s to 1000s of engineers, I’ve developed a preference for statically typed languages over dynamic ones. Even though I kind of like LISP and Ruby.

Frostbite has a quarterly action week where we can work on our own projects, as long at they have some potential value for EA. I used some of my latest action week to write these first few blog posts and create this blog. I’d like to give my sincere thanks and appreciation to Frostbite and EA for this freedom.

The views expressed on this blog are entirely my own and not the views of EA, DICE or Frostbite.