The goal of this page is not to demonstrate the raw speed of the simulator, but rather to demonstrate the speed-up that can be obtained by using multiple cores, processors, GPU's in different ways. You can run these simulations on your own machine (they are based off the examples that ship with the software) and see how your hardware compares to the machines we've tested.

#

Simulation

Hardware

PSS/HSS configuration

PSS/HSS license requirement

Effective Cycle Time (s/cycle)*

Comment

A1

Elbow.sim, 3GB, 3D EUV with Fourier Boundary Condition, non-complex

Box #1: 2x Opteron 285, 16GB DDR400

1x 1-threaded-PSS

1/0

187

Single-core (i.e. no SimRunner)

A2

"

"

1x 4-threaded-PSS

4/0

95

4 cores give 2X speedup with multi-threading (4 cores working on one job)

A3

"

"

2x 1-threaded-PSS

2/0

104

2 cores give almost 2X speedup with job-distribution (2 cores working on two jobs independently).  This is always more efficient than multi-threading, but requires more memory.

A4

"

"

2x 2-threaded-PSS

4/0

61

combination of multi-threading and job distribution seems optimal - 4 cores giving 3X speedup - requires memory for two simulations.  Seems reasonable on AMD dual-core architecture where each processor (pair of cores) has it's own memory controller and "close" memory.

A5

"

"

1x SuperPSS

 -{4x 1-threaded PSS}

4/0

64

almost 3X speedup with 4 cores, but uses less memory than #A4.  Much faster than #A2.

A6

"

Box #1: 2x Opteron 280, 16GB DDR 400

Box #2, 2x Opteron 270, 16GB DDR 400

1x SuperPSS

-{8x 1-threaded PSS}

8/0

60

Not much faster than #A4 or #A5.  Uses less memory per machine than #A4.

A7

"

"

2x SuperPSS{2x 2-threaded-PSS}

8/0

49

The Opteron 270 machine is slower.  If both machines were opteron 285's than we would expect double the performance of #A4.

A8

"

Box #1: 2x Tesla C870

1x 2-GPU-HSS

0/2

18

Simulation fits entirely within two cards.

A9

"

Box #1: 1x Tesla C870

1x 1-GPU_HSS

0/1

29

More than 2X faster than #A4. (1 HSS license vs. 4 PSS licenses)

A10

"

Box #1: 2x Tesla C870

Box #2: 2x Tesla C870

2x 2-GPU-HSS

0/4

9

Double the performance of #A8 (running two cases at once)

A11

"

"

1x SuperPSS{2x 2-GPU-HSS}

0/4

43

Bad performance because of communication overhead for SuperPSS.

B1

AltPSM_Contacts with pitch=2.2 (9.1GB)

Box #1: 2x Opteron 285, 16GB DDR366, 2x C870

1x 1-GPU_HSS

0/1

162

The DDR 400 memory was slowed to 366MHz.

B2

Box #2: 2xIntel 5440 32GB, DDR2 667, 1X C870

1x 1-GPU_HSS

0/1

135

This machine has faster memory compared with B1.

B3

2x SuperPSS{4x 1-threaded-PSS}

8/0

288

Using all 8 cores is slower than 1x Tesla C870 on the same machine. (see B2)

B4

Box #1: 2x Opteron 285, 16GB DDR366, 2x C870

1x 2-GPU_HSS

0/2

101

Using 2 C870's compared to 1 C870 gives 101s to 162s. So, don't get a 2X speedup (as expected) – but do get a decent speed-up (162/101=1.6X speedup)

B5

Box #2: 2xIntel 5440 32GB, DDR2 667, 1X C870

1x 8-threaded PSS

8/0

326

See B3.

B6

2x SuperPSS{2x 2-threaded-PSS}

8/0

314

See B5 & B3.

B7

1x SuperPSS{1x 8-threaded-PSS, 1x 1-GPU_HSS}

8/1

287

Better to just use HSS alone. The PSS's can't help it – just slow it down. See B2.

C1

AltPSM_Contacts, pitch=0.3 (169MB)

Box #2: 2xIntel 5440 32GB, DDR2 667, 1X 8800 GT-OC

1x 1-GPU_HSS

0/1

1.57

This is just a graphics card (8800 GT-OC) with 512MB GDDR3 memory. The card was driving video during the simulation (maybe a bit faster without video)

C2

Box #2: 2xIntel 5440 32GB, DDR2 667, 1X C870

0/1

1.29

Compare to C1. The TESLA C870 beats the less expensive 8800 GT-OC even for small simulation that fits entirely with the card's memory.

C3

Box #2: 2xIntel 5440 32GB, DDR2 667

1x 1-threaded-PSS

1/0

9.90

Tesla C870 is 7.67X faster than single core of Intel 5440. 8800 GT-OC is only 6.3X faster

C4

Box #1: 2x Opteron 285, 16GB DDR366, 2x C870

1/0

9.90

Older Opteron 285 same speed as newer Intel 5440!?

C5

1x 1-GPU_HSS

0/1

1.35

Tesla C870 on Opteron 285 with 366MHz DDR is slower than Tesla C870 on Intel 5440 with DDR2 667MHz. (expected)

D1

AltPSM_Contacts, pitch=0.8 (1.2GB)

Box #1: 2x Opteron 285, 16GB DDR366, 2x C870

1x 1-GPU_HSS

0/1

9.9

Simulation fits entirely within the Tesla C870's 1.5GB memory.

D2

1x 1-threaded-PSS

1/0

143

Compare to D1. Here the Tesla C870 is 14X faster than the Opteron 285 Processor. This is the “sweet spot” for the C870 because the simulation is large, but still fits inside the card.

D3

Box #2: 2xIntel 5440 32GB, DDR2 667

1/0

83

Here we see the newer Intel 5440/DDR2 667MHz beating the older Opteron 285/DDR 366MHz (expected)

D4

Box #2: 2xIntel 5440 32GB, DDR2 667, 1x C870

1x 1-GPU_HSS

0/1

8.7

Here we we 9.5X speed-up when compared to late model Intel 5440 processor. Note, this cycle time is faster than C870 on the older Opteron machine (D1). So, host system does matter.

E1
AltPSM_Contacts, pitch=0.3 (169MB)
Box #1: 2x Opteron 285, 16GB DDR366, 2x C870 1x 1-GPU_HSS
0/1
1.35
compare with E1a
E1a
"
" (but with 2XC1060)
"
"
0.67
compare with E1 - the C1060 has 2X the processing power as the C870
E2
"
" (but with 2X C870)
1x 2-GPU_HSS
0/2
1.42
as expected no improving when using more cards on a small simulation that fits within one card (compare to E1)
E2a
"
" (but with 2X C1060)
"
"
.65
basically same as E1a
E3
"
" (but with 2X C870)
2x 1-GPU_HSS
0/2
.68
running two simulations at the same time - compare to E1
E3a
"
" (but with 2X C1060) "
"
.36
" - compare to E1a
E4
Elbow.sim, 3GB, 3D EUV with Fourier Boundary Condition, non-complex " (but with 2X C870) 2x 1-GPU_HSS
0/2
17.5
Compare with E4a
E4a
"
" (but with 2X C1060) "
"
8.25
C1060 more than 2X faster than C870 - compare with E4
E5
"
" (but with 2X C870) 1x 2-GPU_HSS
0/2
18.4
Compare with E5a
E5a
"
" (but with 2X C1060) "
"
14.8
Not so great improvement of C870 is expected because 2nd card is not utilized at all as simulation fits within the first card.  In E5, both C870's are running at same time, in here (E5a) only one card is running while the other sits idle.
E6
Elbow.sim, with 6 degree incidence (complex simulation) and pitch=76nm, 10GB, 3D EUV with Fourier Boundary Condition
" (but with 2X C870) 1x 2-GPU_HSS
0/2
92
domain divided into 7 parts - the first 6 parts run in simultaneous pairs, and the 7th part runs on one card while the other remains idle - card utilization is 7/8=87.5% (excluding CPU memory xfer overhead)
E6a
"
" (but with 2X C1060) "
"
67
domain divided into 3 parts - the first 2 parts run simultaneously, and the 3rd parts runs on one card while the other remains idle - card utilization is 3/4=75% (excluding CPU memory xfer overhead)  The reason there is not 2X speedup over E6 is because GPU utilization is lower, and CPU xfer overhead might be large - especially since box has DDR 336 (not even DDR2) and only PCI x16 generation 1 (not generation 2.0).  Probably with PCI Express x16 (gen 2) and DDR2 - 800, improvement will be closer to 2X.

*Note:  "Effective" cycle time is the total cycle time divided by the number of cases running.  For example, if you have 5 PSS's running 5 different simulations (of the same size) and each has a cycle time of 10s, then the effective cycle time would be 10s/5=2s.  A "cycle" is amount of time TEMPESTpr2 takes to propagate the fields one wavelength.