LockTests

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.15063.674 (1703/CreatorsUpdate/Redstone2)
Intel Core i5-3317U CPU 1.70GHz (Ivy Bridge), 1 CPU, 4 logical and 2 physical cores
Frequency=1656395 Hz, Resolution=603.7207 ns, Timer=TSC
[Host] : .NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2558.0
DefaultJob : .NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2558.0

MethodFiles MeanErrorStdDevScaledScaledSD
SingleThreaded
A referece fully single-threaded solution. Our goal is to take advantage of multicore CPU, thus we target to have a multithreaded solution that is faster than this one - several different attempts for that follow below.
20000005.385 ms0.0796 ms0.0706 ms1.000.00
MultiThreadedGlobalLock
Locking a single shared global lock in every iteration. Note that the main overhead of locking is here when the lock is not taken - a thread has to wait for other thread to release the global lock. Note the result is not as bad as theoretically possible, as Monitor.Enter tries to spin on a spinlock for a while - thus a lot of waits complete in active fashion and do not go to the passive wait (the waiting thread will not yield the CPU).
2000000107.572 ms0.2929 ms0.2740 ms19.980.25
MultiThreadedGlobalMutex_1_100
Same as above, but the shared lock is not a managed lock, but an OS kernel implemented Mutex. There is not only the overhead of calling to the OS kernel (for that effect see MultiThreadedOneMutexPerItem_1_100 comments below), but because the Mutex does not include a spinlock, everytime the lock is not taken, the thread yields the CPU and goes to passive wait ⇒ the huge locking overhead of this solution (as the critical section length is tiny compared to the time it takes the OS to reschedule the waiting thread).
IMPORTANT NOTE: This solution executes only 1/100 of operations compared to others (in order for the benchmark to end in a reasonable time) ⇒ multiply all times of this solution by 100, i.e. the real time is 9387.3 ms = 9.4 seconds, so not taking the Mutex introduces overhead that makes this solution 1743 times slower than the reference single threaded one.
200000093.873 ms1.7672 ms1.6530 ms17.430.37
MultiThreadedOneLockPerItem
Here we have one lock per every item (every possible string length), thus 128 locks in total (bringing a memory overhead - however still small with respect to the total of 2000000 input strings). Note this solution is still slower than the reference single threaded solution, as while there is now a rather small chance of 2 threads fighting over a single lock (i.e. small chance for a thread not to take a lock), taking the lock successfully still takes some time ⇒ as the critical section is very short (just few instructions) the locking and unlocking overhead (including a memory barrier) takes comparable amount of time as the actual critical section's body execution ⇒ this solution is still slower than single threaded one.
200000072.216 ms0.3058 ms0.2860 ms13.410.18
MultiThreadedOneMutexPerItem_1_100
Same as above, but all the 128 locks are replaced by 128 OS kernel implemented Mutexes. Note that here also most threads almost everytime take the lock, thus the overhead of yielding the CPU (solution MultiThreadedGlobalMutex_1_100 above) is not present, but there is still overhead of kernel implemented synchronization primitive, thus this solution is slower then the one above (here we have overhead of locking/unlocking the Mutex including overhead of two syscalls to kernel per critical section).
20000007.443 ms0.0067 ms0.0059 ms1.380.02
MultiThreadedOneLockPerItemRange
This solution splits counts into several ranges, each sharing a lock ⇒ this brings no advantages here, just notice that as there is now a higher probability of 2 threads fighting for a lock, so this solution speed is in between MultiThreadedOneLockPerItem and MultiThreadedGlobalLock solutions.
200000088.031 ms0.3325 ms0.3111 ms16.350.21
MultiThreadedLocalCountsViaThreadLocal
Theoretically sound solution: each thread has now a private (isolated ⇒ no need for locking) array of all counts, and generates its own view of the final result. After processing all input data, all the subresults from worker threads are merged together in the calling thread. Problem is that we do not know in which thread the process lambda gets executed, thus it is hard to know, which counts array is "ours" ⇒ solution here is to use a ThreadLocal variable to hold per thread counts array. In order to be able to do the final merging we enabled tracking of all per thread values in the ThreadLocal instance ⇒ this however introduces a small overhead (together with accessing ThreadLocal in every iteration of a very tight loop) that made this solution also slower than the reference single-threaded solution.
200000033.228 ms0.5288 ms0.4688 ms6.170.11
MultiThreadedPartitionCountsViaThreadLocalIds
Disables the per thread tracking of values in ThreadLocal instance by using ThreadLocal just to store 0-based thread IDs (used as indexes into List of local counts). Introduces overhead of indexer of List (thus slower than previous solution) + intoduces a potential RACE CONDITION (see benchmarks source code for details).
200000035.531 ms0.6831 ms0.6390 ms6.680.15
MultiThreadedPartitionCountsViaThreadLocalIdsCorrectedViaLocking
Solves the race condition above by locking List's lock, which reintroduces a global shared lock ⇒ becomes the slowest solution without Mutexes.
2000000168.721 ms0.5039 ms0.4467 ms31.730.42
MultiThreadedPartitionCountsViaThreadLocalIdsCorrectedViaPreallocation
Another way to solve the race condition, that gets rid of locking and List as well. However still has almost the same speed as original solution with ThreadLocal, as the overhead of accessing ThreadLocal from a tight loop stays.
200000032.632 ms0.5259 ms0.4919 ms6.140.12
MultiThreadedLocalCountsViaPartitions
Solution not iterating over individual items, but splits the input data into several partitions (jobs), and then the parallel for iterates over these predefined jobs. Every job has its own (isolated) array of counts (subresults for the specific partition) so no locking is necessary. Merging of the all sub results is done in the calling thread - this is not a big problem, as the "partition count" * "items in counts array" (subresults) is much smaller than the amount of input data (and input data processing is done in multiple threads) ⇒ this solution has a non-trivial speedup, but on a 2 core machine it is definitelly not a 2-times faster then single threaded solution (we should not expect the Scaled column of the table to be 0.5), as not everything is done in multiple threads. However this is a good compromise, as the parallel work now executes without threads blocking each other, thus this solution is finally (non-trivially) faster than single threaded solution.
20000003.595 ms0.0223 ms0.0186 ms0.67 (SPEEDUP)0.01
AggressiveForLocalCountsViaPartitions
Same "best" solution as above, but now we are not using the Parallel.For infrastructure of sharing threads, but we are creating 32 of our own threads (note that Parallel.For is for this case using 6 pre-started threads). Result is slower (in fact now once again even slower than the single threaded solution), as the whole job is too short (only units of miliseconds) to mitigate the overhead of thread creation (see the results for bigger task below, where the job is big enough to "consume" the thread creation overhead).
20000006.964 ms0.0504 ms0.0447 ms1.290.02
VeryAggressiveForLocalCountsViaPartitions
Manual thread creation as above - now creating 512 threads - the total thread creation overhead is now huge for a milisecond job.
200000080.325 ms0.2020 ms0.1461 ms14.920.19
SingleThreaded2000000052.376 ms0.3259 ms0.3049 ms1.000.00
MultiThreadedGlobalLock200000001,064.712 ms0.9144 ms0.8553 ms20.330.12
MultiThreadedGlobalMutex_1_100200000001,050.898 ms21.3432 ms62.9308 ms20.071.20
MultiThreadedOneLockPerItem20000000700.022 ms5.2827 ms4.9414 ms13.370.12
MultiThreadedOneMutexPerItem_1_1002000000076.375 ms0.5554 ms0.5195 ms1.460.01
MultiThreadedOneLockPerItemRange20000000901.060 ms21.9222 ms20.5060 ms17.200.39
MultiThreadedLocalCountsViaThreadLocal20000000324.110 ms3.1844 ms2.9787 ms6.190.07
MultiThreadedLocalCountsViaPartitions2000000034.689 ms0.1856 ms0.1736 ms0.660.00
AggressiveForLocalCountsViaPartitions
Creates 32 threads. As we can see here, if the job is big enough the overhead of creating excessive amount of threads almost dissipates - for 10-times more items than same solution above this solution is finally faster than single threaded variant, and is almost as fast as reusing threads, as Parallel.For does (allocating them from a thread pool - more on thread pool in later lectures). IMPORTANT NOTE: There is still a non-trivial memory overhead with having excessive number of threads - e.g. at least one page of stack per thread, etc.
2000000038.711 ms0.1340 ms0.1254 ms0.740.00
VeryAggressiveForLocalCountsViaPartitions
Creates 512 threads. The job is still small for this amount of threads - the relative overhead is now smaller than for 10-times smaller task, but the total overhead of creation of 512 threads makes it still slower than the single threaded variant.
2000000098.237 ms0.7386 ms0.6909 ms1.880.02