SUPERCOMPUTER ARTIFICIAL INTELLIGENCE CONSULTING

Falcon ML Consulting Competitive Advantagefor Investment Banks

Top Investment Banking
Competitive Advantages Using Machine Learning (ML)

Machine Learning (ML) provides a technological breakthrough for the Financial Industry that provides millisecond decision improvements that translate into more profitable decisions

Developing ML solutions for the Financial Industry is one of the riskiest, most costly and time-sensitive tasks facing this industry.

Developing new ML applications for the Financial Industry & running these applications with high performance architectures requires experienced practitioners with stellar track records.

Competitive Success – Senior Executive Page 1

Falcon ML Consulting Competitive Advantagefor Investment Banks

•Situation: CEO/President, large investment firm.

Needs: Projected yearly growth revenue & profits of 30%. Competitive advantage that improves trade decisions & accuracy. Decreases in R&D costs, risks and scheduling

•Reason: Electronic trading staff is not keeping up with new advances in supercomputer architecture software/hardware & artificial intelligence mathematics. Poor R/D planning & tracking & costs are high. Integration schedules are lengthy. Less funding available for innovation projects

Competitive Success – Senior Executive Page 2

Falcon ML Consulting Competitive Advantagefor Investment Banks

Competitive Success – Senior Executive Page 2

FALCON ML Consulting Solution:

High Performance Computing (HPC) consulting company descendant of the Oryx Corporation

Oryx Corp. Founder: John Hiller

Dr. Narendra Ahuja – Artificial Intelligence & Computer Science (Azriel Rosenfeld PhD Advisor) – Student Advisees 60 PhD, 20 MS, 100 Undergrad Research, 14 postdocs, University of Illinois – Labs developed Computer Vision & Robotics Lab Beckman Institute (Research)

Oryx Corp. developed both commercial and militarized Supercomputer (HPC)

$19M Funded by Eastman Kodak, Venture Capital, Grumman Corp and Lockheed Martin

•One of the earliest developers/users of Graphical User Interface programing software for math

Used by DARPA for the first successful Autonomous Land Vehicle (AI)

Oryx Corp. Accomplishments:

Designed/Completed/Tested a Crossbar-based Supercomputer within 18 months

Experience in Development/Testing rich library of Parallel math algorithms used in sonar/image/radar processing applications requiring high computational and memory bandwidth applicable to AI and ML

Linear Algebra

Neural Net

Vision Processing, Optimization

Competitive Success – Senior Executive Page 3

PAST: Oryx Funded by a $3.6M R/D Contract from Eastman Kodak Corporate Followed by $15M Venture – A

Competitive Success – Senior Executive Page 2

Results:

Rapid entry into the AI/ML market arena
Faster stock trades and better stock trade accuracy
Reduced risks in R/D
Reduced costs in R/D
Reduced schedules in R/D

PAST: Oryx Corp. Supercomputer Was Built & Tested Using Advanced Algorithms from Linpack, DARPA

PAST: Oryx Funded by a $3.6M R/D Contract from Eastman Kodak Corporate Followed by $15M Venture – A

Sponsors: Kodak, Grumman, General Electric, Martin Marietta, Unisys, Lockheed, Raytheon, etc.
Applications: Radar, sonar, image processing, electronic warfare, autonomous land vehicle, communications intercept, etc.
Founder: John Hiller (Founder, CEO, CTO)
Consultants: Dr. Ariel Rosenfeld (Father of Image Processing), U Maryland. Dr. Rosenfeld - lead architect of the DARPA benchmark suite – Hough Transform (finding straight lines in an image), Traveling Salesman, etc.
Products: Oryx SSP R&D development computer architecture hardware & software broke world records for DARPA & LinpackAlgorithms---directly applicable to efficiently provide the advantages necessary for millisecond quickness in making better trading decisions.

PAST: Oryx Funded by a $3.6M R/D Contract from Eastman Kodak Corporate Followed by $15M Venture – A

Oryx initial development funding ($3.6M) – all technical milestones (hardware, Math algorithms and Flowgraph editor) demonstrated on a parallel processor in an 18-month work schedule
Oryx 2nd round funding of $15M from top venture firms in the USA
New Enterprise Associates, Oxford Partners, Polyventures, Grumman Ventures, East Kodak Ventures, Spectra Ventures, Investech, & Allied Signal Corporate.
Delivered on-board units to Grumman, Lockheed, etc

SUPERCOMPUTER ARTIFICIAL INTELLIGENCE CONSULTING

John Hiller, Oryx Corporation Founder, Supercomputer Architecture Cornerstone Patent

Past: Oryx High Performance Supercomputer System capabilities Directly Applicable to AI Performance

John Hiller, Oryx Corporation Founder, Supercomputer Architecture Cornerstone Patent

Highly parallel computer architecture employing crossbar switch with selectable pipeline delay
United States Patent 5081575

A crossbar switch which connects N (N=2k ; k=0, 1, 2, 3) coarse grain processing elements (rated at 20 million floating point operations per second) to a plurality of memories provides for a parallel processing system free of memory conflicts over a wide range of arithmetic computations (i.e. scalar, vector and matrix). The configuration of the crossbar switch, i.e., the connection between each processing element unit and each parallel memory module, may be changed dynamically on a cycle-by-cycle basis in accordance with the requirements of the algorithm under execution. Although there are certain crossbar usage rules which must be obeyed, the data is mapped over parallel memory such that the processing element units can access and operate on input streams of data in a highly parallel fashion with an effective memory transfer rate and computational throughput power comparable in performance to present-day supercomputers. The crossbar switch is comprised of two basic sections; a multiplexer and a control section. The multiplexer provides the actual switching of signal paths, i.e. connects each processing element unit to a particular parallel memory on each clock cycle. The control section determines which connections are made on each clock cycle in accordance with the algorithm under execution. Selectable pipelined delay in the control section provides for optimal data transfer efficiency between the processors and memory modules over a wide range of array processing algorithms. The crossbar switch also provides for graceful system degradation in computational throughput power without the need to download a new program.

John Hiller, Oryx Corporation Founder, Supercomputer Software Patent

Past: Oryx High Performance Supercomputer System capabilities Directly Applicable to AI Performance

John Hiller, Oryx Corporation Founder, Supercomputer Architecture Cornerstone Patent

Method for Automated Deployment Of Software Program onto a Multi-Processor Architecture
United States Patent 5418953

The method is employed for pre-assignment & scheduling of tasks that enables allocation across multiple physical processors arranged in a variety of architectures. The assigning action attempts to arrive at minimal cost value for all tasks comprising the problem

Past: Oryx High Performance Supercomputer System capabilities Directly Applicable to AI Performance

Multiple processing elements containing floating point, fixed point & logical on computational nodes(2,4,8,etc) connected to multiple memories(2,4,8,etc) via Crossbar Switch
Scalar, vector, & matrix computation capabilities

Fully connected Parallel Crossbar Switch free of memory access conflicts over a wide range of applications
High & effective memory transfer rates
Data mapped to parallel memories that are readily accessible for input/output & matrix processing

Dr. Narendra Ahuja Artificial Intelligence Math & Computer Science Capabilities – Page 1

Past: Oryx High Performance Supercomputer System capabilities Directly Applicable to AI Performance

Alma mater

University of Maryland, College Park

Known for

Face detection, Video understanding, Motion planning, 3D Vision, Computational cameras, Pattern Recognition

Awards

IEEE Emanuel R. Piore Award (1999)
IEEE Fellow (1992)
ACM Fellow (1996)
Presidential Young Investigator Award (1984)
AAAI Fellow (1992)
SPIE Technology Achievement Award (1998)

Scientific career

Fields

Computer science, Computer vision, Artificial Intelligence, Machine Learning, Robotics

Institutions

Professor, University of Illinois, Urbana-Champaign
Founding Director, International Institute of Information Technology, Hyderabad

Founding Director, Information Technology Research Academy, Delhi

Thesis

Mosaic Models for Image Analysis and Synthesis (1979)

Doctoral advisor

Azriel Rosenfeld

Dr. Narendra Ahuja Artificial Intelligence Math & Computer Science Capabilities Page 2

Dr. Narendra Ahuja Artificial Intelligence Math & Computer Science Capabilities – Page 1

Dr. Narendra Ahuja Artificial Intelligence Math & Computer Science Capabilities – Page 3

Computer vision, advanced artificial intelligence, machine learning/pattern recognition, computer architecture, probability theory, robotics, virtual environments, digital signal processing and knowledge networks
Computational approach to automatically extract syntax of images & use it for automated image understanding
Special purpose cameras
Understanding 3D scenes from Stereo, Motion and Texture
Machine learning: Efficient, Explainable, Physics inspired, Multimodal
Algorithms and computational complexity
PhD in computer science from University of Maryland, College Park MD advised by AzrielRosenfeld (Father of Image Processing) – Thesis Mosaic Models for Image Analysis and Synthesis – Azriel Rosenfeld Advisor
Publications: 3 Books Co-Authored, 20 Book Chapters, 115 Journal Articles, 350 Conference Papers, 4 Patents
Student Advisees: 60 PhDs, 30 MS
•New Labs Developed: Computer Vision and Robotics Lab (Research), Robotics (Teaching)
New Courses Started: Computer Vision, Pattern Recognition, Robotics
Consulting: SAIC, Battelle Corp., AT&T, Lockheed, Eastman Kodak, Honeywell, Westinghouse, HRL Labs
Development: Automated rail inspection; Active and Passive Cameras - omnifocus, hemispherical, high-dynamic-range, stereo; Fingerprint recognition

Dr. Narendra Ahuja Artificial Intelligence Math & Computer Science Capabilities – Page 3

Dr. Narendra Ahuja Artificial Intelligence Math & Computer Science Capabilities – Page 1

Dr. Narendra Ahuja Artificial Intelligence Math & Computer Science Capabilities – Page 3

Patents

·N. Ahuja and M. Tabb, Multiscale Image Edge and Region Detection Method and Apparatus, U. S. Patent, Issued September 1998. ·

R. Dugad and N. Ahuja, Transform Domain Significant Coefficient Digital Image Watermarking Method, U.S. patent Application filed, June 2000. ·

H. Hua and N. Ahuja, Method and Apparatus for a High-Resolution and Real-Time Panoramic Camera, Patent Application Filed, November 2001.

A. Krishnan and N. Ahuja, Imaging Apparatus and Method for Determining Range from Focus and Focus Information, U. S., Patent, Issued September 1995. European Patent issued, October 2001.

H. Hua and N. Ahuja, Method and Apparatus for a High-Resolution and Real-Time Panoramic Camera, Patent Application Filed, November 2001. ·

R. Dugad and N. Ahuja, Transformation of Image Parts in Different domains to Obtain Resultant Image Size Different From Initial Image Size, U.S. Patent, Issued October 2004.

Books Authored

·N. Ahuja and B. Schachter, Pattern Models, Wiley, l983. ·

J. Weng, T. S. Huang and N. Ahuja, Motion and Structure from Image Sequences, Springer-Verlag, 1992.

Ming-Hsuan Yang and N. Ahuja, Face Detection and Hand Gesture Recognition for Vision-Based Human Computer Interaction, Kluwer Academic Publishers, 2001.

Enabling Nvidia to Make Faster & more accurate stock trades

Nvidia Streaming MultiprocessorGA100

256 x 256 Matrix Multiplyon Nvidia GA100 GPU Host to Host Runtimes

256 x 256 Matrix Multiplyon Nvidia GA100 GPU Efficiencies

128 (27)SM/GPU
32(26) CudaCores/SM
Total # Cores 4096 (213)/GPU
Two FP32 Multor Add/CudaCore
64 FP32 Per SM
One FP64 Multor Add /CudaCore
32 FP64 Per SM
40 GB HBM2 Total – 6 per GPU
40 MB L2 cache Total – 6 per GPU
L1 data cache = 128 per GPU, ----192 Kbytes per L1
Each Cuda Core has 8 Kbytes– 4 Register Files
1.2 Ghz. clock

256 x 256 Matrix Multiplyon Nvidia GA100 GPU Efficiencies

256 x 256 Matrix Multiplyon Nvidia GA100 GPU Host to Host Runtimes

256 x 256 Matrix Multiplyon Nvidia GA100 GPU Efficiencies

•Dense Memories to L1 transfer: 27,306/41,130 = 66.4%
•L1 to CUDA core (RFs) transfer: 8,448/ 41,130 = 20.5%
•CUDA core FP320s: 4,096/41,130 = 10.0%
•CUDA core FP321s: 4,096 +256 = 5,120/41,130 = 12.4%

•

256 x 256 Matrix Multiplyon Nvidia GA100 GPU Host to Host Runtimes

256 x 256 Matrix Multiplyon Nvidia GA100 GPU Runtimes in Clock Cycles

Host Gen 4 PCI to 6 Device Memories 32 Gbytes/sec transfer of 786,432 bytes = 25 microseconds
6 Device Memories to 6 L2s 177Gbytes/sec transfer of 786,432 bytes = 4.5 microseconds
6 L2s to 128 L1s 280Gbytes/sec transfer of 786,432 bytes = 2.8 microseconds
L1 to 32 CUDA Cores 10.6 Gbytes/sec transfer of of 137,216 bytes = 13.0 microseconds
CUDA COREs Computation using 1.2 Ghz.clock (.833 nanoseconds per clock) of 13,824 clocks = 10.7 microseconds
Total runtime = 56 microseconds plus software overhead

256 x 256 Matrix Multiplyon Nvidia GA100 GPU Runtimes in Clock Cycles

27,306 clocks 5,461 clocks for movement of A Matrix evenly distributed across 6 L2s---each A row is sent to a pair of L1s---64 pairs of L1s hold the A Matrix. 16,384clocks for movement of one half of B matrix into 2 L2s---one L2 broadcasts to 64 of 128 L1s---the other L2 broadcasts to the other 64 L1s. Data in & out of Dense Memories, L2s, L1s. 5,461clocks to move C matrix back to dense memories via L2s.
13,824 clocks 4 passes (1024 C matrices per pass)
41,130 clocks total
256 X 256 C matrix in 6 dense memories tied to host bus

2048 x 2048 2D Convolutionon Nividia GA100 GPU Efficiencies

256 x 256 Matrix Multiplyon Nvidia GA100 GPU Runtimes in Clock Cycles

2048 x 2048 2D Convolutionon Nividia GA100 GPU Host to Host Runtimes

Dense Memories to L1 transfer: 174,762/188,458 = 92.7%
L1 to CUDA core transfer: 2,528/ 188,458 = 1.3%
ALUs: 11,168/188,458 = 5.9%

2048 x 2048 2D Convolutionon Nividia GA100 GPU Host to Host Runtimes

256 x 256 Matrix Multiplyon Nvidia GA100 GPU Runtimes in Clock Cycles

2048 x 2048 2D Convolutionon Nividia GA100 GPU Host to Host Runtimes

Host Gen 4 PCI to 6 Device Memories 32 Gbytes/sec transfer of 8,388,608 bytes = 262 microseconds
6 Device Memories to 6 L2s 177Gbytes/sec transfer of 786,432 bytes = 47 microseconds
6 L2s to 128 L1s 280Gbytes/sec transfer of 786,432 bytes = 30 microseconds
L1 to 32 CUDA Cores 10.6 Gbytes/sec transfer of of 65,536 bytes = 6.2 microseconds
CUDA COREs Computation using 1.2 Ghz.clock (.833 nanoseconds per clock) of 11,168 clocks = 9.3 microseconds
Total runtime = 354 microseconds plus software overhead

ENABLING NVIDIA TO MAKE FASTER & MORE ACCURATE STOCK TRADES

2048 x 2048 2D Convolutionon Nividia GA100 GPU Runtimes in Clock Cycles

174,762 clocks 174,762 clocks for movement of Input & Output Matrices evenly distributed across 6 L2s---128 (16 X 16 Matrix) to / from each one of 128 L1s.
13,440 clocks 32 passes (128 (16 X 16) matrices per pass). 352 clocks per pass compute. 68 clocks for data to/from L1s and CUDA Cores. 420 clocks total per pass.
188,202 clocks total
2048 X 2048 matrix in 6 dense memories tied to host bus

John Hiller executive Sales & Technology Consultant

SUPERCOMPUTER ARTIFICIAL INTELLIGENCE CONSULTING

Falcon ML Consulting Competitive Advantage for Investment Banks

Falcon ML Consulting Competitive Advantage for Investment Banks

Falcon ML Consulting Competitive Advantage for Investment Banks

Competitive Success – Senior Executive Page 1

Falcon ML Consulting Competitive Advantage for Investment Banks

Falcon ML Consulting Competitive Advantage for Investment Banks

Competitive Success – Senior Executive Page 2

Falcon ML Consulting Competitive Advantage for Investment Banks

Competitive Success – Senior Executive Page 2

Competitive Success – Senior Executive Page 3

PAST: Oryx Funded by a $3.6M R/D Contract from Eastman Kodak Corporate Followed by $15M Venture – A

Competitive Success – Senior Executive Page 2

PAST: Oryx Corp. Supercomputer Was Built & Tested Using Advanced Algorithms from Linpack, DARPA

PAST: Oryx Funded by a $3.6M R/D Contract from Eastman Kodak Corporate Followed by $15M Venture – A

PAST: Oryx Funded by a $3.6M R/D Contract from Eastman Kodak Corporate Followed by $15M Venture – A

PAST: Oryx Funded by a $3.6M R/D Contract from Eastman Kodak Corporate Followed by $15M Venture – A

PAST: Oryx Funded by a $3.6M R/D Contract from Eastman Kodak Corporate Followed by $15M Venture – A

PAST: Oryx Funded by a $3.6M R/D Contract from Eastman Kodak Corporate Followed by $15M Venture – A

SUPERCOMPUTER ARTIFICIAL INTELLIGENCE CONSULTING

John Hiller, Oryx Corporation Founder, Supercomputer Architecture Cornerstone Patent

Past: Oryx High Performance Supercomputer System capabilities Directly Applicable to AI Performance

John Hiller, Oryx Corporation Founder, Supercomputer Architecture Cornerstone Patent

John Hiller, Oryx Corporation Founder, Supercomputer Software Patent

Past: Oryx High Performance Supercomputer System capabilities Directly Applicable to AI Performance

John Hiller, Oryx Corporation Founder, Supercomputer Architecture Cornerstone Patent

Past: Oryx High Performance Supercomputer System capabilities Directly Applicable to AI Performance

Past: Oryx High Performance Supercomputer System capabilities Directly Applicable to AI Performance

Past: Oryx High Performance Supercomputer System capabilities Directly Applicable to AI Performance

Dr. Narendra Ahuja Artificial Intelligence Math & Computer Science Capabilities – Page 1

Dr. Narendra Ahuja Artificial Intelligence Math & Computer Science Capabilities – Page 1

Past: Oryx High Performance Supercomputer System capabilities Directly Applicable to AI Performance

Dr. Narendra Ahuja Artificial Intelligence Math & Computer Science Capabilities Page 2

Dr. Narendra Ahuja Artificial Intelligence Math & Computer Science Capabilities – Page 1

Dr. Narendra Ahuja Artificial Intelligence Math & Computer Science Capabilities – Page 3

Dr. Narendra Ahuja Artificial Intelligence Math & Computer Science Capabilities – Page 3

Dr. Narendra Ahuja Artificial Intelligence Math & Computer Science Capabilities – Page 1

Dr. Narendra Ahuja Artificial Intelligence Math & Computer Science Capabilities – Page 3

Enabling Nvidia to Make Faster & more accurate stock trades

Nvidia Streaming Multiprocessor GA100

256 x 256 Matrix Multiply on Nvidia GA100 GPU Host to Host Runtimes

256 x 256 Matrix Multiply on Nvidia GA100 GPU Efficiencies

256 x 256 Matrix Multiply on Nvidia GA100 GPU Efficiencies

256 x 256 Matrix Multiply on Nvidia GA100 GPU Host to Host Runtimes

256 x 256 Matrix Multiply on Nvidia GA100 GPU Efficiencies

256 x 256 Matrix Multiply on Nvidia GA100 GPU Host to Host Runtimes

256 x 256 Matrix Multiply on Nvidia GA100 GPU Host to Host Runtimes

256 x 256 Matrix Multiply on Nvidia GA100 GPU Runtimes in Clock Cycles

256 x 256 Matrix Multiply on Nvidia GA100 GPU Runtimes in Clock Cycles

256 x 256 Matrix Multiply on Nvidia GA100 GPU Runtimes in Clock Cycles

256 x 256 Matrix Multiply on Nvidia GA100 GPU Runtimes in Clock Cycles

2048 x 2048 2D Convolution on Nividia GA100 GPU Efficiencies

256 x 256 Matrix Multiply on Nvidia GA100 GPU Runtimes in Clock Cycles

2048 x 2048 2D Convolution on Nividia GA100 GPU Host to Host Runtimes

2048 x 2048 2D Convolution on Nividia GA100 GPU Host to Host Runtimes

256 x 256 Matrix Multiply on Nvidia GA100 GPU Runtimes in Clock Cycles

2048 x 2048 2D Convolution on Nividia GA100 GPU Host to Host Runtimes

ENABLING NVIDIA TO MAKE FASTER & MORE ACCURATE STOCK TRADES

2048 x 2048 2D Convolution on Nividia GA100 GPU Runtimes in Clock Cycles

This website uses cookies.

Falcon ML Consulting Competitive Advantagefor Investment Banks

Falcon ML Consulting Competitive Advantagefor Investment Banks

Falcon ML Consulting Competitive Advantagefor Investment Banks

Falcon ML Consulting Competitive Advantagefor Investment Banks

Falcon ML Consulting Competitive Advantagefor Investment Banks

Falcon ML Consulting Competitive Advantagefor Investment Banks

Nvidia Streaming MultiprocessorGA100

256 x 256 Matrix Multiplyon Nvidia GA100 GPU Host to Host Runtimes

256 x 256 Matrix Multiplyon Nvidia GA100 GPU Efficiencies

256 x 256 Matrix Multiplyon Nvidia GA100 GPU Efficiencies

256 x 256 Matrix Multiplyon Nvidia GA100 GPU Host to Host Runtimes

256 x 256 Matrix Multiplyon Nvidia GA100 GPU Efficiencies

256 x 256 Matrix Multiplyon Nvidia GA100 GPU Host to Host Runtimes

256 x 256 Matrix Multiplyon Nvidia GA100 GPU Host to Host Runtimes

256 x 256 Matrix Multiplyon Nvidia GA100 GPU Runtimes in Clock Cycles

256 x 256 Matrix Multiplyon Nvidia GA100 GPU Runtimes in Clock Cycles

256 x 256 Matrix Multiplyon Nvidia GA100 GPU Runtimes in Clock Cycles

256 x 256 Matrix Multiplyon Nvidia GA100 GPU Runtimes in Clock Cycles

2048 x 2048 2D Convolutionon Nividia GA100 GPU Efficiencies

256 x 256 Matrix Multiplyon Nvidia GA100 GPU Runtimes in Clock Cycles

2048 x 2048 2D Convolutionon Nividia GA100 GPU Host to Host Runtimes

2048 x 2048 2D Convolutionon Nividia GA100 GPU Host to Host Runtimes

256 x 256 Matrix Multiplyon Nvidia GA100 GPU Runtimes in Clock Cycles

2048 x 2048 2D Convolutionon Nividia GA100 GPU Host to Host Runtimes

2048 x 2048 2D Convolutionon Nividia GA100 GPU Runtimes in Clock Cycles