Articles
The goal of this page is not to demonstrate the raw speed of the simulator, but rather to demonstrate the speedup that can be obtained by using multiple cores, processors, GPU's in different ways. You can run these simulations on your own machine (they are based off the examples that ship with the software) and see how your hardware compares to the machines we've tested.
# 
Simulation 
Hardware 
PSS/HSS configuration 
PSS/HSS license requirement 
Effective Cycle Time (s/cycle)^{*} 
Comment 
A1 
Elbow.sim, 3GB, 3D EUV with Fourier Boundary Condition, noncomplex 
Box #1: 2x Opteron 285, 16GB DDR400 
1x 1threadedPSS 
1/0 
187 
Singlecore (i.e. no SimRunner) 
A2 
" 
" 
1x 4threadedPSS 
4/0 
95 
4 cores give 2X speedup with multithreading (4 cores working on one job) 
A3 
" 
" 
2x 1threadedPSS 
2/0 
104 
2 cores give almost 2X speedup with jobdistribution (2 cores working on two jobs independently). This is always more efficient than multithreading, but requires more memory. 
A4 
" 
" 
2x 2threadedPSS 
4/0 
61 
combination of multithreading and job distribution seems optimal  4 cores giving 3X speedup  requires memory for two simulations. Seems reasonable on AMD dualcore architecture where each processor (pair of cores) has it's own memory controller and "close" memory. 
A5 
" 
" 
1x SuperPSS {4x 1threaded PSS} 
4/0 
64 
almost 3X speedup with 4 cores, but uses less memory than #A4. Much faster than #A2. 
A6 
" 
Box #1: 2x Opteron 280, 16GB DDR 400 Box #2, 2x Opteron 270, 16GB DDR 400 
1x SuperPSS {8x 1threaded PSS} 
8/0 
60 
Not much faster than #A4 or #A5. Uses less memory per machine than #A4. 
A7 
" 
" 
2x SuperPSS{2x 2threadedPSS} 
8/0 
49 
The Opteron 270 machine is slower. If both machines were opteron 285's than we would expect double the performance of #A4. 
A8 
" 
Box #1: 2x Tesla C870 
1x 2GPUHSS 
0/2 
18 
Simulation fits entirely within two cards. 
A9 
" 
Box #1: 1x Tesla C870 
1x 1GPU_HSS 
0/1 
29 
More than 2X faster than #A4. (1 HSS license vs. 4 PSS licenses) 
A10 
" 
Box #1: 2x Tesla C870 Box #2: 2x Tesla C870 
2x 2GPUHSS 
0/4 
9 
Double the performance of #A8 (running two cases at once) 
A11 
" 
" 
1x SuperPSS{2x 2GPUHSS} 
0/4 
43 
Bad performance because of communication overhead for SuperPSS. 
B1 
AltPSM_Contacts with pitch=2.2 (9.1GB) 
Box #1: 2x Opteron 285, 16GB DDR366, 2x C870 
1x 1GPU_HSS 
0/1 
162 
The DDR 400 memory was slowed to 366MHz. 
B2 
“ 
Box #2: 2xIntel 5440 32GB, DDR2 667, 1X C870 
1x 1GPU_HSS 
0/1 
135 
This machine has faster memory compared with B1. 
B3 
“ 
“ 
2x SuperPSS{4x 1threadedPSS} 
8/0 
288 
Using all 8 cores is slower than 1x Tesla C870 on the same machine. (see B2) 
B4 
“ 
Box #1: 2x Opteron 285, 16GB DDR366, 2x C870 
1x 2GPU_HSS 
0/2 
101 
Using 2 C870's compared to 1 C870 gives 101s to 162s. So, don't get a 2X speedup (as expected) – but do get a decent speedup (162/101=1.6X speedup) 
B5 
“ 
Box #2: 2xIntel 5440 32GB, DDR2 667, 1X C870 
1x 8threaded PSS 
8/0 
326 
See B3. 
B6 
“ 
“ 
8/0 
314 
See B5 & B3. 

B7 
“ 
“ 
1x SuperPSS{1x 8threadedPSS, 1x 1GPU_HSS} 
8/1 
287 
Better to just use HSS alone. The PSS's can't help it – just slow it down. See B2. 
C1 
AltPSM_Contacts, pitch=0.3 (169MB) 
Box #2: 2xIntel 5440 32GB, DDR2 667, 1X 8800 GTOC 
1x 1GPU_HSS 
0/1 
1.57 
This is just a graphics card (8800 GTOC) with 512MB GDDR3 memory. The card was driving video during the simulation (maybe a bit faster without video) 
C2 
“ 
Box #2: 2xIntel 5440 32GB, DDR2 667, 1X C870 
“ 
0/1 
1.29 
Compare to C1. The TESLA C870 beats the less expensive 8800 GTOC even for small simulation that fits entirely with the card's memory. 
C3 
“ 
Box #2: 2xIntel 5440 32GB, DDR2 667 
1x 1threadedPSS 
1/0 
9.90 
Tesla C870 is 7.67X faster than single core of Intel 5440. 8800 GTOC is only 6.3X faster 
C4 
“ 
Box #1: 2x Opteron 285, 16GB DDR366, 2x C870 
“ 
1/0 
9.90 
Older Opteron 285 same speed as newer Intel 5440!? 
C5 
“ 
“ 
1x 1GPU_HSS 
0/1 
1.35 
Tesla C870 on Opteron 285 with 366MHz DDR is slower than Tesla C870 on Intel 5440 with DDR2 667MHz. (expected) 
D1 
AltPSM_Contacts, pitch=0.8 (1.2GB) 
Box #1: 2x Opteron 285, 16GB DDR366, 2x C870 
1x 1GPU_HSS 
0/1 
9.9 
Simulation fits entirely within the Tesla C870's 1.5GB memory. 
D2 
“ 
“ 
1x 1threadedPSS 
1/0 
143 
Compare to D1. Here the Tesla C870 is 14X faster than the Opteron 285 Processor. This is the “sweet spot” for the C870 because the simulation is large, but still fits inside the card. 
D3 
“ 
Box #2: 2xIntel 5440 32GB, DDR2 667 
“ 
1/0 
83 
Here we see the newer Intel 5440/DDR2 667MHz beating the older Opteron 285/DDR 366MHz (expected) 
D4 
“ 
Box #2: 2xIntel 5440 32GB, DDR2 667, 1x C870 
1x 1GPU_HSS 
0/1 
8.7 
Here we we 9.5X speedup when compared to late model Intel 5440 processor. Note, this cycle time is faster than C870 on the older Opteron machine (D1). So, host system does matter. 
E1 
AltPSM_Contacts, pitch=0.3 (169MB) 
Box #1: 2x Opteron 285, 16GB DDR366, 2x C870  1x 1GPU_HSS 
0/1 
1.35 
compare with E1a 
E1a 
" 
" (but with 2XC1060) 
" 
" 
0.67 
compare with E1  the C1060 has 2X the processing power as the C870 
E2 
" 
" (but with 2X C870) 
1x 2GPU_HSS 
0/2 
1.42 
as expected no improving when using more cards on a small simulation that fits within one card (compare to E1) 
E2a 
" 
" (but with 2X C1060) 
" 
" 
.65 
basically same as E1a 
E3 
" 
" (but with 2X C870) 
2x 1GPU_HSS 
0/2 
.68 
running two simulations at the same time  compare to E1 
E3a 
" 
" (but with 2X C1060)  " 
" 
.36 
"  compare to E1a 
E4 
Elbow.sim, 3GB, 3D EUV with Fourier Boundary Condition, noncomplex  " (but with 2X C870)  2x 1GPU_HSS 
0/2 
17.5 
Compare with E4a 
E4a 
" 
" (but with 2X C1060)  " 
" 
8.25 
C1060 more than 2X faster than C870  compare with E4 
E5 
" 
" (but with 2X C870)  1x 2GPU_HSS 
0/2 
18.4 
Compare with E5a 
E5a 
" 
" (but with 2X C1060)  " 
" 
14.8 
Not so great improvement of C870 is expected because 2nd card is not utilized at all as simulation fits within the first card. In E5, both C870's are running at same time, in here (E5a) only one card is running while the other sits idle. 
E6 
Elbow.sim, with 6 degree incidence (complex simulation) and pitch=76nm, 10GB, 3D EUV with Fourier Boundary Condition 
" (but with 2X C870)  1x 2GPU_HSS 
0/2 
92 
domain divided into 7 parts  the first 6 parts run in simultaneous pairs, and the 7th part runs on one card while the other remains idle  card utilization is 7/8=87.5% (excluding CPU memory xfer overhead) 
E6a 
" 
" (but with 2X C1060)  " 
" 
67 
domain divided into 3 parts  the first 2 parts run simultaneously, and the 3rd parts runs on one card while the other remains idle  card utilization is 3/4=75% (excluding CPU memory xfer overhead) The reason there is not 2X speedup over E6 is because GPU utilization is lower, and CPU xfer overhead might be large  especially since box has DDR 336 (not even DDR2) and only PCI x16 generation 1 (not generation 2.0). Probably with PCI Express x16 (gen 2) and DDR2  800, improvement will be closer to 2X. 
^{*}Note: "Effective" cycle time is the total cycle time divided by the number of cases running. For example, if you have 5 PSS's running 5 different simulations (of the same size) and each has a cycle time of 10s, then the effective cycle time would be 10s/5=2s. A "cycle" is amount of time TEMPESTpr2 takes to propagate the fields one wavelength.