John Hiller executive Sales & Technology Consultant

John Hiller executive Sales & Technology ConsultantJohn Hiller executive Sales & Technology ConsultantJohn Hiller executive Sales & Technology Consultant
  • Home
  • Selling
  • Supercomputing GPU AI
  • Services
  • References
  • Sales Barriers & Answers
  • Sales Engine
  • Experience
  • Supercomputing Engine
  • Sales Brochure
  • John Hiller Capabilities
  • About
  • More
    • Home
    • Selling
    • Supercomputing GPU AI
    • Services
    • References
    • Sales Barriers & Answers
    • Sales Engine
    • Experience
    • Supercomputing Engine
    • Sales Brochure
    • John Hiller Capabilities
    • About

John Hiller executive Sales & Technology Consultant

John Hiller executive Sales & Technology ConsultantJohn Hiller executive Sales & Technology ConsultantJohn Hiller executive Sales & Technology Consultant
  • Home
  • Selling
  • Supercomputing GPU AI
  • Services
  • References
  • Sales Barriers & Answers
  • Sales Engine
  • Experience
  • Supercomputing Engine
  • Sales Brochure
  • John Hiller Capabilities
  • About

FALCON ML CONSULTING-MORE ACCURATE data center / stock trade

JOHN HILLER SUPERCOMPUTER SOFTWARE PATENT

PAST: JOHN HILLER SUPERCOMPUTER BROKE WORLD RECORDS FOR PERFORMANCE IN DARPA, LINPACK

JOHN HILLER CORNERSTONE SUPERCOMPUTER ARCHITECTURE PATENT

Method for Automated Deployment Of Software Program onto a Multi-Processor Architecture

  • United States Patent 5418953


  • The method is employed for pre-assignment & scheduling of tasks that enables allocation across multiple physical processors arranged in a variety of architectures.  The assigning action attempts to arrive at minimal cost value for all tasks comprising the problem

JOHN HILLER CORNERSTONE SUPERCOMPUTER ARCHITECTURE PATENT

PAST: JOHN HILLER SUPERCOMPUTER BROKE WORLD RECORDS FOR PERFORMANCE IN DARPA, LINPACK

JOHN HILLER CORNERSTONE SUPERCOMPUTER ARCHITECTURE PATENT

Highly parallel computer architecture employing crossbar switch with selectable pipeline delay

  • United States Patent 5081575


  • A crossbar switch which connects N (N=2k ; k=0, 1, 2, 3) coarse grain processing elements (rated at 20 million floating point operations per second) to a plurality of memories provides for a parallel processing system free of memory conflicts over a wide range of arithmetic computations (i.e. scalar, vector and matrix). The configuration of the crossbar switch, i.e., the connection between each processing element unit and each parallel memory module, may be changed dynamically on a cycle-by-cycle basis in accordance with the requirements of the algorithm under execution. 


PAST: JOHN HILLER SUPERCOMPUTER BROKE WORLD RECORDS FOR PERFORMANCE IN DARPA, LINPACK

PAST: JOHN HILLER SUPERCOMPUTER BROKE WORLD RECORDS FOR PERFORMANCE IN DARPA, LINPACK

PAST: JOHN HILLER SUPERCOMPUTER BROKE WORLD RECORDS FOR PERFORMANCE IN DARPA, LINPACK

  • Past: Oryx High Performance Supercomputer System capabilities Directly Applicable to GPU AI Performance 



  • Multiple processing elements containing floating point, fixed point & logical on computational nodes(2,4,8,etc) connected to multiple memories(2,4,8,etc) via Crossbar Switch
     
  • Scalar, vector, & matrix computation capabilities


  • Fully connected Parallel Crossbar Switch free of memory access conflicts over a wide range of applications
     
  • High & effective memory transfer rates
     
  • Data mapped to parallel memories that are readily accessible for input/output & matrix processing


  • This Supercomputer architecture  and the math running on it provide a method for comparing (selecting Nividia, AMD or Intel) 2025 Graphics Processing Unit (GPU) systems to gain the latest Artificial Intelligence application high performance.

NVIDIA STREAMING MULTIPROCESSOR GA100

NVIDIA STREAMING MULTIPROCESSOR GA100

PAST: JOHN HILLER SUPERCOMPUTER BROKE WORLD RECORDS FOR PERFORMANCE IN DARPA, LINPACK

  • 128 (27)SM/GPU 
  • 32(26) CudaCores/SM
  • Total # Cores 4096 (213)/GPU
  • Two FP32 Multor Add/CudaCore
  • 64 FP32 Per SM
  • One FP64 Multor Add /CudaCore
  • 32 FP64 Per SM
  • 40 GB HBM2 Total – 6 per GPU
  • 40 MB L2 cache Total – 6 per GPU
  • L1 data cache = 128 per GPU, ----192 Kbytes per L1
  • Each Cuda Core has 8 Kbytes– 4 Register Files
  • 1.2 Ghz. clock

NVIDIA GA100

NVIDIA STREAMING MULTIPROCESSOR GA100

NVIDIA GA100

128 SM

128 L1

FALCON ML CONSULTING-MORE ACCURATE DATA CENTER / STOCK TRADE

NVIDIA GA100 GPU

NVIDIA GA100 GPU

FALCON ML CONSULTING-MORE ACCURATE DATA CENTER / STOCK TRADE

Nvidia Streaming Multiprocessor GA100

256 x 256 Matrix Multiply on Nvidia GA100 GPU Host to Host Runtimes

256 x 256 Matrix Multiply on Nvidia GA100 GPU Efficiencies

  • 128 (27)SM/GPU 
  • 32(26) CudaCores/SM
  • Total # Cores 4096 (213)/GPU
  • Two FP32 Mult or Add/CudaCore
  • 64 FP32 Per SM
  • One FP64 Multor Add /CudaCore
  • 32 FP64 Per SM
  • 40 GB HBM2 Total – 6 per GPU
  • 40 MB L2 cache Total – 6 per GPU
  • L1 data cache = 128 per GPU, ----192 Kbytes per L1
  • Each Cuda Core has 8 Kbytes– 4 Register Files
  • 1.2 Ghz. clock


256 x 256 Matrix Multiply on Nvidia GA100 GPU Efficiencies

256 x 256 Matrix Multiply on Nvidia GA100 GPU Host to Host Runtimes

256 x 256 Matrix Multiply on Nvidia GA100 GPU Efficiencies

  • Dense Memories to L1 transfer: 27,306/41,130 = 66.4%
  • L1 to CUDA core (RFs) transfer: 8,448/ 41,130 = 20.5%
  • CUDA core FP320s: 4,096/41,130 = 10.0%
  • CUDA core FP321s: 4,096 +256 = 5,120/41,130 = 12.4%

256 x 256 Matrix Multiply on Nvidia GA100 GPU Host to Host Runtimes

256 x 256 Matrix Multiply on Nvidia GA100 GPU Host to Host Runtimes

256 x 256 Matrix Multiply on Nvidia GA100 GPU Runtimes in Clock Cycles

  • Host Gen 4 PCI to 6 Device Memories 32 Gbytes/sec transfer of 786,432 bytes = 25 microseconds
  • 6 Device Memories to 6 L2s 177Gbytes/sec transfer of 786,432 bytes = 4.5 microseconds
  • 6 L2s to 128 L1s 280Gbytes/sec transfer of 786,432 bytes = 2.8 microseconds
  • L1 to 32 CUDA Cores 10.6 Gbytes/sec transfer of of 137,216 bytes = 13.0 microseconds
  • CUDA COREs Computation using 1.2 Ghz.clock (.833 nanoseconds per clock) of 13,824 clocks = 10.7 microseconds
  • Total runtime = 56 microseconds plus software overhead

256 x 256 Matrix Multiply on Nvidia GA100 GPU Runtimes in Clock Cycles

256 x 256 Matrix Multiply on Nvidia GA100 GPU Runtimes in Clock Cycles

256 x 256 Matrix Multiply on Nvidia GA100 GPU Runtimes in Clock Cycles

  • 27,306 clocks 5,461 clocks for movement of A Matrix evenly distributed across 6 L2s---each A row is sent to a pair of L1s---64 pairs of L1s hold the A Matrix. 16,384clocks for movement of one half of B matrix into 2 L2s---one L2 broadcasts to 64 of 128 L1s---the other L2 broadcasts to the other 64 L1s. Data in & out of Dense Memories, L2s, L1s. 5,461clocks to move C matrix back to dense memories via L2s.
  • 13,824 clocks 4 passes (1024 C matrices per pass)
  • 41,130 clocks total 
  • 256 X 256 C matrix in 6 dense memories tied to host bus

2048 x 2048 2D Convolution on Nividia GA100 GPU Efficiencies

256 x 256 Matrix Multiply on Nvidia GA100 GPU Runtimes in Clock Cycles

2048 x 2048 2D Convolution on Nividia GA100 GPU Host to Host Runtimes

  • Dense Memories to L1 transfer: 174,762/188,458 = 92.7%
  • L1 to CUDA core transfer: 2,528/ 188,458 = 1.3%
  • ALUs:   11,168/188,458 = 5.9%

2048 x 2048 2D Convolution on Nividia GA100 GPU Host to Host Runtimes

256 x 256 Matrix Multiply on Nvidia GA100 GPU Runtimes in Clock Cycles

2048 x 2048 2D Convolution on Nividia GA100 GPU Host to Host Runtimes

  • Host Gen 4 PCI to 6 Device Memories 32 Gbytes/sec transfer of 8,388,608 bytes = 262 microseconds
  • 6 Device Memories to 6 L2s 177Gbytes/sec transfer of 786,432 bytes = 47 microseconds
  • 6 L2s to 128 L1s 280Gbytes/sec transfer of 786,432 bytes = 30 microseconds
  • L1 to 32 CUDA Cores 10.6 Gbytes/sec transfer of of 65,536 bytes = 6.2 microseconds
  • CUDA COREs Computation using 1.2 Ghz.clock (.833 nanoseconds per clock) of 11,168 clocks = 9.3 microseconds
  • Total runtime = 354 microseconds plus software overhead

FALCON ML CONSULTING-MORE ACCURATE DATA CENTER / STOCK TRADE

2048 x 2048 2D Convolution on Nividia GA100 GPU Runtimes in Clock Cycles

  • 174,762 clocks 174,762 clocks for movement of Input & Output Matrices evenly distributed across 6 L2s---128 (16 X 16 Matrix) to / from each one of 128 L1s. 
  • 13,440 clocks 32 passes (128 (16 X 16) matrices per pass). 352 clocks per pass compute. 68 clocks for data to/from L1s and CUDA Cores. 420 clocks total per pass. 
  • 188,202 clocks total 
  • 2048 X 2048 matrix in 6 dense memories tied to host bus

John Hiller Sales & Technology ConsultanT

294B Shore Drive (1041) Montague, NJ 07827, USA

+1.8456725431

Copyright © 2026 John Hiller Sales & Technology Consultant - All Rights Reserved.

Powered by

This website uses cookies.

We use cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.

Accept