BenchmarkDotNet=v0.10.14, OS=Windows 10.0.15063.674 (1703/CreatorsUpdate/Redstone2)
Intel Core i5-3317U CPU 1.70GHz (Ivy Bridge), 1 CPU, 4 logical and 2 physical cores
Frequency=1656395 Hz, Resolution=603.7207 ns, Timer=TSC
[Host] : .NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2558.0
DefaultJob : .NET Framework 4.7.1 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2558.0
Method | Files | Mean | Error | StdDev | Scaled | ScaledSD |
---|---|---|---|---|---|---|
SingleThreaded A referece fully single-threaded solution. Our goal is to take advantage of multicore CPU, thus we target to have a multithreaded solution that is faster than this one - several different attempts for that follow below. | 2000000 | 5.385 ms | 0.0796 ms | 0.0706 ms | 1.00 | 0.00 |
MultiThreadedGlobalLock Locking a single shared global lock in every iteration. Note that the main overhead of locking is here when the lock is not taken - a thread has to wait for other thread to release the global lock. Note the result is not as bad as theoretically possible, as Monitor.Enter tries to spin on a spinlock for a while - thus a lot of waits complete in active fashion and do not go to the passive wait (the waiting thread will not yield the CPU). | 2000000 | 107.572 ms | 0.2929 ms | 0.2740 ms | 19.98 | 0.25 |
MultiThreadedGlobalMutex_1_100 Same as above, but the shared lock is not a managed lock, but an OS kernel implemented Mutex. There is not only the overhead of calling to the OS kernel (for that effect see MultiThreadedOneMutexPerItem_1_100 comments below), but because the Mutex does not include a spinlock, everytime the lock is not taken, the thread yields the CPU and goes to passive wait ⇒ the huge locking overhead of this solution (as the critical section length is tiny compared to the time it takes the OS to reschedule the waiting thread). IMPORTANT NOTE: This solution executes only 1/100 of operations compared to others (in order for the benchmark to end in a reasonable time) ⇒ multiply all times of this solution by 100, i.e. the real time is 9387.3 ms = 9.4 seconds, so not taking the Mutex introduces overhead that makes this solution 1743 times slower than the reference single threaded one. | 2000000 | 93.873 ms | 1.7672 ms | 1.6530 ms | 17.43 | 0.37 |
MultiThreadedOneLockPerItem Here we have one lock per every item (every possible string length), thus 128 locks in total (bringing a memory overhead - however still small with respect to the total of 2000000 input strings). Note this solution is still slower than the reference single threaded solution, as while there is now a rather small chance of 2 threads fighting over a single lock (i.e. small chance for a thread not to take a lock), taking the lock successfully still takes some time ⇒ as the critical section is very short (just few instructions) the locking and unlocking overhead (including a memory barrier) takes comparable amount of time as the actual critical section's body execution ⇒ this solution is still slower than single threaded one. | 2000000 | 72.216 ms | 0.3058 ms | 0.2860 ms | 13.41 | 0.18 |
MultiThreadedOneMutexPerItem_1_100 Same as above, but all the 128 locks are replaced by 128 OS kernel implemented Mutexes. Note that here also most threads almost everytime take the lock, thus the overhead of yielding the CPU (solution MultiThreadedGlobalMutex_1_100 above) is not present, but there is still overhead of kernel implemented synchronization primitive, thus this solution is slower then the one above (here we have overhead of locking/unlocking the Mutex including overhead of two syscalls to kernel per critical section). | 2000000 | 7.443 ms | 0.0067 ms | 0.0059 ms | 1.38 | 0.02 |
MultiThreadedOneLockPerItemRange This solution splits counts into several ranges, each sharing a lock ⇒ this brings no advantages here, just notice that as there is now a higher probability of 2 threads fighting for a lock, so this solution speed is in between MultiThreadedOneLockPerItem and MultiThreadedGlobalLock solutions. | 2000000 | 88.031 ms | 0.3325 ms | 0.3111 ms | 16.35 | 0.21 |
MultiThreadedLocalCountsViaThreadLocal Theoretically sound solution: each thread has now a private (isolated ⇒ no need for locking) array of all counts, and generates its own view of the final result. After processing all input data, all the subresults from worker threads are merged together in the calling thread. Problem is that we do not know in which thread the process lambda gets executed, thus it is hard to know, which counts array is "ours" ⇒ solution here is to use a ThreadLocal variable to hold per thread counts array. In order to be able to do the final merging we enabled tracking of all per thread values in the ThreadLocal instance ⇒ this however introduces a small overhead (together with accessing ThreadLocal in every iteration of a very tight loop) that made this solution also slower than the reference single-threaded solution. | 2000000 | 33.228 ms | 0.5288 ms | 0.4688 ms | 6.17 | 0.11 |
MultiThreadedPartitionCountsViaThreadLocalIds Disables the per thread tracking of values in ThreadLocal instance by using ThreadLocal just to store 0-based thread IDs (used as indexes into List of local counts). Introduces overhead of indexer of List (thus slower than previous solution) + intoduces a potential RACE CONDITION (see benchmarks source code for details). | 2000000 | 35.531 ms | 0.6831 ms | 0.6390 ms | 6.68 | 0.15 |
MultiThreadedPartitionCountsViaThreadLocalIdsCorrectedViaLocking Solves the race condition above by locking List's lock, which reintroduces a global shared lock ⇒ becomes the slowest solution without Mutexes. | 2000000 | 168.721 ms | 0.5039 ms | 0.4467 ms | 31.73 | 0.42 |
MultiThreadedPartitionCountsViaThreadLocalIdsCorrectedViaPreallocation Another way to solve the race condition, that gets rid of locking and List as well. However still has almost the same speed as original solution with ThreadLocal, as the overhead of accessing ThreadLocal from a tight loop stays. | 2000000 | 32.632 ms | 0.5259 ms | 0.4919 ms | 6.14 | 0.12 |
MultiThreadedLocalCountsViaPartitions Solution not iterating over individual items, but splits the input data into several partitions (jobs), and then the parallel for iterates over these predefined jobs. Every job has its own (isolated) array of counts (subresults for the specific partition) so no locking is necessary. Merging of the all sub results is done in the calling thread - this is not a big problem, as the "partition count" * "items in counts array" (subresults) is much smaller than the amount of input data (and input data processing is done in multiple threads) ⇒ this solution has a non-trivial speedup, but on a 2 core machine it is definitelly not a 2-times faster then single threaded solution (we should not expect the Scaled column of the table to be 0.5), as not everything is done in multiple threads. However this is a good compromise, as the parallel work now executes without threads blocking each other, thus this solution is finally (non-trivially) faster than single threaded solution. | 2000000 | 3.595 ms | 0.0223 ms | 0.0186 ms | 0.67 (SPEEDUP) | 0.01 |
AggressiveForLocalCountsViaPartitions Same "best" solution as above, but now we are not using the Parallel.For infrastructure of sharing threads, but we are creating 32 of our own threads (note that Parallel.For is for this case using 6 pre-started threads). Result is slower (in fact now once again even slower than the single threaded solution), as the whole job is too short (only units of miliseconds) to mitigate the overhead of thread creation (see the results for bigger task below, where the job is big enough to "consume" the thread creation overhead). | 2000000 | 6.964 ms | 0.0504 ms | 0.0447 ms | 1.29 | 0.02 |
VeryAggressiveForLocalCountsViaPartitions Manual thread creation as above - now creating 512 threads - the total thread creation overhead is now huge for a milisecond job. | 2000000 | 80.325 ms | 0.2020 ms | 0.1461 ms | 14.92 | 0.19 |
SingleThreaded | 20000000 | 52.376 ms | 0.3259 ms | 0.3049 ms | 1.00 | 0.00 |
MultiThreadedGlobalLock | 20000000 | 1,064.712 ms | 0.9144 ms | 0.8553 ms | 20.33 | 0.12 |
MultiThreadedGlobalMutex_1_100 | 20000000 | 1,050.898 ms | 21.3432 ms | 62.9308 ms | 20.07 | 1.20 |
MultiThreadedOneLockPerItem | 20000000 | 700.022 ms | 5.2827 ms | 4.9414 ms | 13.37 | 0.12 |
MultiThreadedOneMutexPerItem_1_100 | 20000000 | 76.375 ms | 0.5554 ms | 0.5195 ms | 1.46 | 0.01 |
MultiThreadedOneLockPerItemRange | 20000000 | 901.060 ms | 21.9222 ms | 20.5060 ms | 17.20 | 0.39 |
MultiThreadedLocalCountsViaThreadLocal | 20000000 | 324.110 ms | 3.1844 ms | 2.9787 ms | 6.19 | 0.07 |
MultiThreadedLocalCountsViaPartitions | 20000000 | 34.689 ms | 0.1856 ms | 0.1736 ms | 0.66 | 0.00 |
AggressiveForLocalCountsViaPartitions Creates 32 threads. As we can see here, if the job is big enough the overhead of creating excessive amount of threads almost dissipates - for 10-times more items than same solution above this solution is finally faster than single threaded variant, and is almost as fast as reusing threads, as Parallel.For does (allocating them from a thread pool - more on thread pool in later lectures). IMPORTANT NOTE: There is still a non-trivial memory overhead with having excessive number of threads - e.g. at least one page of stack per thread, etc. | 20000000 | 38.711 ms | 0.1340 ms | 0.1254 ms | 0.74 | 0.00 |
VeryAggressiveForLocalCountsViaPartitions Creates 512 threads. The job is still small for this amount of threads - the relative overhead is now smaller than for 10-times smaller task, but the total overhead of creation of 512 threads makes it still slower than the single threaded variant. | 20000000 | 98.237 ms | 0.7386 ms | 0.6909 ms | 1.88 | 0.02 |