User Tools

Site Tools


parallel-timing

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

parallel-timing [2018/10/08 17:16] (current)
Line 1: Line 1:
 +
 +~~CLOSETOC~~
 +
 +~~TOC 1-3 wide~~
 +
 +
 +## Embarrassingly Parallel Processes
 +
 +Julia offers three ways to run embarrassingly parallel processes on multiple CPU cores simultaneously:​ threads, @distributed,​ and pmap.
 +
 +* The performance benchmarks are shown in detail in the [[#​appendix|appendix]]. ​ They show that differences in the results can be fragile. ​ The performance can depend not only on how many function calls there are and how long each function call takes (ideally for parallelism,​ as many function calls as there are number of CPU cores, all taking very long to keep parallel overhead low); but also on whether the memory for the tasks can remain in the CPU memory cache. ​ In general, on one computer, if the function is memory-light (think less than 32K, including all data access), and/or if function calls are relatively long-lived, then **threads** typically have the best performance. ​ In this case, `@distributed` tends to be better.
 +
 +* Typically, you want to use one more processes as you have CPU threads. ​ The choice is secondary to getting the above recommendations right.
 +
 +* The accounting is somewhat painful. ​ To use 2 processes in @distributed,​ you need to add them, because the first process is the master. ​ Having `nprocs()==2` will only give you one real worker process, and one process hanging around.
 +
 +
 +## Performance
 +
 +The Xeon used for these benchmarks has a 2^12 L1 cache, a 2^15 L2 cache, and a 2^17.x L3 cache. ​ To compare different cache scenarios, we consider four scenarios, designed to fit into these relative constraints: ​ small: 2^11 = 16KB.   ​medium:​ 2^14 = 128KB. ​  ​large:​ 2^17 = 1MB.   ​vlarge:​ 2^19 = 4MB.  The calls can be frequent ("many short" invokations = context switches), medium, or rare ("few long"​).
 +
 +
 +
 +^ **memory** ​  ​^ ​ **calls** ​ ^  **Solo** ^  **Threads 2** ^  **Distrib 2** ^  **Pmap w/2** ^  **Threads 16** ^  **Distrib 16** ^  **Pmap w/16** ^
 +|    |   ​| ​  ​| ​ **2 Work Cores (-p3)** ​ |||  **16 Work Cores** ​ |||
 +| small 16k     | many short       ​| ​ 1.0 |  0.80 | 0.50 | -2.20 |  0.17 |  0.28 | -1.30 |  use threads ​ |
 +| medium 128k   | many short       ​| ​ 1.0 |  0.83 | 0.50 |  0.94 | -3.79 |  0.36 |  0.53 |    |
 +| large 1MB     | many short       ​| ​ 1.0 | -2.53 | 0.76 |  1.55 | -2.47 |  0.29 |  0.75 |  avoid threads ​ |
 +| vlarge 4MB    | many short       ​| ​ 1.0 | -3.84 | 0.62 |  0.76 | -2.29 |  0.47 |  0.45 |  avoid threads ​ |
 +| small 16k     | some medium ​     |  1.0 |  0.90 | 0.50 |  0.60 |  0.10 |  0.10 |  0.15 |
 +| medium 128k   | some medium ​     |  1.0 |  0.59 | 0.50 |  0.59 |  0.14 |  0.14 |  0.14 |
 +| large 1MB     | some medium ​     |  1.0 |  0.74 | 0.52 |  0.59 |  0.19 |  0.11 |  0.11 |
 +| vlarge 4MB    | some medium ​     |  1.0 | -2.57 | 0.51 |  0.57 |  0.57 |  0.40 |  0.20 |  avoid threads ​ |
 +| small 16k     | few long         ​| ​ 1.0 |  0.50 | 0.50 |  0.50 |  0.10 |  0.40 |  0.40 |
 +| medium 128k   | few long         ​| ​ 1.0 |  0.75 | 0.50 |  0.50 |  0.10 |  0.10 |  0.40 |
 +| large 1MB     | few long         ​| ​ 1.0 |  0.75 | 0.50 |  0.55 |  0.10 |  0.10 |  0.10 |
 +| vlarge 4MB    | few long         ​| ​ 1.0 |  0.50 | 0.50 |  0.55 |  0.10 |  0.10 |  0.10 |
 +
 +Note: Negative numbers mean values greater than 1, and help the visual formatting.
 +
 +WARNING 2 CPUs mean `-p3`, not `-p2`. ​ This is because the first process just hangs around as (nearly useless) master doling out tasks.
 +
 +**Advice**
 +
 +* Threads and @distributed processes both scale well now.  The only time to use threads is when the memory footprint is very small. ​ Otherwise, @distributed does just as well.
 +
 +* pmap may be convenient, but it is often not particularly good.
 +
 +* Not shown: Unlike threads, requesting more processes or tasks than physical CPUs on the same system is often harmless.
 +
 +* Not shown: A few worker processes more than CPU cores can help keep the CPU Happy. ​ [Chris Rackauckas](http://​www.stochasticlifestyle.com/​236-2/​) found similar effects. ​ Adding a few more workers than this is mostly harmless.
 +
 +
 +
 +
 +# Appendix
 +
 +## Hardware and Software
 +
 +The 8-core 16-thread processor is a `Xeon Skylake W-2140B: 64GB RAM.   Cache 8x32k, 8x1m, 8x1.375m`. ​ Julia:
 +
 +```text
 +Julia Version 1.0.0
 +Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
 +Platform Info:
 +  OS: macOS (x86_64-apple-darwin14.5.0)
 +  CPU: Intel(R) Xeon(R) W-2140B CPU @ 3.20GHz
 +  WORD_SIZE: 64
 +  LIBM: libopenlibm
 +  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
 +
 +```
 +
 +## Raw Output (Not Normalized to 1.0)
 +
 +In the results below, there are always two sets of output.
 +
 +^ **memory** ​  ​^ ​ **calls** ​ ^  **Solo** ^  **Threads 2** ^  **Distrib 2** ^  **Pmap w/2** ^  **Threads 16** ^  **Distrib 16** ^  **Pmap w/16** ^
 +|    |   ​| ​  ​| ​ **2 Work Cores (-p3)** ​ |||  **16 Work Cores** ​ |||
 +| small 16k     | many short  ​  ​| ​     4.6 |       3.7 |    2.3 |  10.1 |    0.8 |   1.3 |   6.0 |
 +| medium 128k   | many short  ​  ​| ​    21.0 |      17.5 |   10.6 |  19.7 |   79.5 |   7.5 |  11.2 |
 +| large 1MB     | many short  ​  ​| ​    49.5 |     125.0 |   37.6 |  76.6 |  122.5 |  14.6 |  37.3 |
 +| vlarge 4MB    | many short  ​  ​| ​   143.0 |     549.3 |   88.2 |  108.1 |  328.1 |  67.6 |  64.7 |
 +| small 16k     | some medium  ​  ​| ​     2.0 |       1.8 |    1.0 |   1.2 |    0.2 |   0.2 |   0.3 |
 +| medium 128k   | some medium  ​  ​| ​     2.2 |       1.3 |    1.1 |   1.3 |    0.3 |   0.3 |   0.3 |
 +| large 1MB     | some medium  ​  ​| ​     2.7 |       2.0 |    1.4 |   1.6 |    0.5 |   0.3 |   0.3 |
 +| vlarge 4MB    | some medium  ​  ​| ​     3.5 |       9.0 |    1.8 |   2.0 |    2.0 |   1.4 |   0.7 |
 +| small 16k     | few long    ​| ​     2.0 |       1.0 |    1.0 |   1.0 |    0.2 |   0.8 |   0.8 |
 +| medium 128k   | few long    ​| ​     2.0 |       1.5 |    1.0 |   1.0 |    0.2 |   0.2 |   0.8 |
 +| large 1MB     | few long    ​| ​     2.0 |       1.5 |    1.0 |   1.1 |    0.2 |   0.2 |   0.2 |
 +| vlarge 4MB    | few long    ​| ​     2.0 |       1.0 |    1.0 |   1.1 |    0.2 |   0.2 |   0.2 |
 +
 +* Multiple CPU have more variable performance. ​ For example, small+medium produced 1.0 to 2.6 as timings, not just 1.8
 +
 +
 +
 +## Benchmarking Program
 +
 +This program (`parallel-test.jl`) is neither clever nor useful. ​ It exists to explore relative parallel performance.
 +
 +```julia
 +
 +using Distributed,​ BenchmarkTools,​ Printf
 +
 +"""​
 +
 + i7 cache  L3= 2MB. L2= 256KB. L1= 32KB.
 +
 + Float Matrix: 2^18 = 2MB.  2^15 = 256 KB.  2^12 = 32KB
 +
 +==> 2^11 for small. ​  2^14 for medium. ​  2^17 for large. ​  2^19 for verylarge.
 +
 +
 +"""​
 +
 +@everywhere const setmemoryuse= [ :small, :medium, :large, :verylarge ]
 +@everywhere const settimelength= [ :short, :medium, :long ]
 +
 +@everywhere function testrun( memoryuse::​Symbol,​ timelength::​Symbol )::Float64
 +
 +    @assert( memoryuse in setmemoryuse,​ "​memoryuse cannot be $memoryuse"​ )
 +    @assert( timelength in settimelength,​ "​timelength cannot be $timelength"​ )
 +
 +    mmx= 2^( (memoryuse == :small) ? 11 : (memoryuse == :medium) ? 14 : (memoryuse == :large) ? 17 : 19 )
 +    rep= (timelength == :short) ? 1 : (timelength == :medium) ? 100 : 10000
 +
 +    if (timelength == :short)
 +        ## in this case, most of the time is taken in the memory allocation, so adjust a little for it.
 +        (memoryuse == :verylarge) && ( rep/=10.0 )
 +        (memoryuse == :large) && ( rep/=5.0 )
 +        (memoryuse == :medium) && ( rep/=2.0 )
 +    end#if
 +
 +    @assert( mmx >= 2^11, "​please call with larger array" )
 +    circidx( i::Int, arrlen::Int )::​Int= ​ mod( i-1, arrlen ) + 1;
 +    bvector= Vector{Float64}( collect(1:​mmx) )  ## this can take a lot of time, in line with mmx, for short functions
 +
 +    sum= 0.0
 +    for j=1:rep
 +        for i=j:​(j+1023)
 +            bvector[ circidx( i-1, mmx ) ]= sin( bvector[ circidx( i-1, mmx ) ] )
 +        end#for t#
 +    end
 +    bvector[1]
 +
 +end;##​function##​
 +
 +
 +@assert( ((Threads.nthreads() == 1) || (nprocs() == 1)), "​please decide on threads or nprocs, but not both" )
 +
 +const SUPERPLAIN=false
 +
 +(SUPERPLAIN) && @assert( ((Threads.nthreads() == 1) && (nprocs() == 1)), "​please superplain or don'​t"​ )
 +
 +
 +for itimelength=1:​length(settimelength)
 +    for imemoryuse=1:​length(setmemoryuse)
 +
 +        timelength= settimelength[ itimelength ]
 +        memoryuse= setmemoryuse[ imemoryuse ]
 +
 +        repoutside= (10000 // ( (timelength == :short) ? 1 : (timelength == :medium) ? 100 : 10000 ))*16
 +
 +        hdrstring= string( @sprintf("​@ %10.10s ", memoryuse), @sprintf("​%10.10s ", timelength),​ @sprintf("​%02d/​%02d :", nprocs(), Threads.nthreads()) )
 +
 +        if (SUPERPLAIN)
 +            print("​SOLO:​ $hdrstring\tthread:​\t"​)
 +            @btime begin
 +                timelength= settimelength[ $itimelength ]
 +                memoryuse= setmemoryuse[ $imemoryuse ]
 +                for i=1:​$repoutside;​ testrun( memoryuse, timelength ); end#for
 +            end#begin#
 +
 +        elseif (Threads.nthreads() > 1)
 +            print("​$hdrstring\tthread:​\t"​)
 +
 +            @btime begin
 +                timelength= settimelength[ $itimelength ]
 +                memoryuse= setmemoryuse[ $imemoryuse ]
 +                Threads.@threads for i=1:​$repoutside;​ testrun( memoryuse, timelength ); end#for
 +            end#begin#
 +
 +        else
 +
 +            print("​$hdrstring\tpmap df:​\t"​)
 +            @btime begin
 +                memoryuse= setmemoryuse[ $imemoryuse ]
 +                timelength= settimelength[ $itimelength ]
 +                pmap( x->​testrun( memoryuse, timelength ), 1:​$repoutside )
 +            end#begin#
 +
 +            print("​$hdrstring\tdstrbtd:​\t"​)
 +            @btime begin
 +                memoryuse= setmemoryuse[ $imemoryuse ]
 +                timelength= settimelength[ $itimelength ]
 +                @sync @distributed for i=1:​$repoutside;​ testrun( memoryuse, timelength ); end#for
 +            end#begin#
 +
 +        end#if#
 +
 +        println()
 +
 +    end#for
 +    println("​ --\n")
 +
 +end#for
 +
 +```
  
parallel-timing.txt ยท Last modified: 2018/10/08 17:16 (external edit)