# Two Electron Integrals calculation accelerated with Double Precision exp() Hardware Module

Maciej Wielgosz<sup>2</sup>, Marcin Pietroń<sup>2</sup>, Ernest Jamro<sup>1,2</sup>, Paweł Russek<sup>1,2</sup> ,

Kazimierz Wiatr<sup>1,2</sup> <sup>1</sup> Institute of Electronics AGH, Kraków

<sup>2</sup>ACK "Cyfronet" AGH, Kraków

#### 1. Introduction

FPGA implementation of double precision exponential function module is presented here. The module will be incorporated in the Gaussian system to accelerate the extremely time consuming exponential function evaluation. The exp function is accelerated on SGI RASC board with two Virtex-4 LX200 FPGA. The exp() function alone occupies less than 3% Virtex-4 LX200 FPGA. Exp() arguments are fetched to the FPGA's and results are sent back to processors over the system bus working at speed of NUMAlink 6,4 GB/s. The exponential module reaches the processing speed of 200 MHz, . The external memory interface limits the number of operation (down) to two exp() every clock cycle per a FPGA. The overall end-to-end algorithm execution speedup that authors expect to achieve is 4x as compared to the sequential implementation of the algorithm executed on a single 2 GHz Intel Itanium2 processor.



## The proposed *exp()* module consists of the following sub-modules:

-exceptional states (*inf*, *NaN*) detectioning logic for input data; this unit also converts input data to internal fixed point standard (barrel shifter). -exponent evaluation module, which separates fractional and integer part (which corresponds to exponent field of the result) - sign migration from fractional to integer part

-LUTs which store fractional elementary values of *exp()*, polynomial approximation and multipliers

-conversion to IEEE-754 standard

#### 3. Profiling

There are multiple exponential functions in the source code but only several of them are heavily employed in most of the common chemical computation (tasks. For example, while computing benzene molecule the exp() function is executed a few billions times. One of the exp() function hot spots is the subroutine responsible for functional computation in solving the Hartree-Fock equation.

Employment of well known profilers (e.g. gnuprof) was the first step to profile the Gaussian application. Results of using these profilers were not satisfying (e.g. not compatible with Gaussian binaries). Consequently, a new dedicated profiling tool was developed. This tool is able to:

- parametrise functions' monitoring (functions' name, location)
- evoke graphs
- estimate time of function calls
- monitoring of data flow

# 4. Software – Hardware CoDesign

Automated tool will be developed and extended with additional options that will enable investigation of the source code to find the hot-spots. The tool is to facilitate parameterization of the user environment. We also concentrate on automated extacting of inherent parallelism from the source code.



#### 5. Implementation results

### Implementation results for Virtex-4 LX200

| implementation              | #4-input | # flip-  | #18-Kb | DSP48 |
|-----------------------------|----------|----------|--------|-------|
|                             | LUTs     | flops    | BRAMs  |       |
| 1) With DSP48               | 1293     | 105      | 6      | 71    |
|                             | (0.73%)  | (0.06%)  | (1.8%) | (74%) |
| 2) Without                  | 13375    | 105      | 6      | 0     |
| DSP48                       | (7.5%)   | (0.06%)  | (1.8%) |       |
| <ol><li>optimized</li></ol> | 5025     | 5223     | 6      | 0     |
| multipliers                 | (3%)     | (3%)     | (1.8%) |       |
| 4) RASC                     | 14,521   | 20,125   | 29     | 0     |
| system                      | (8%)     | (11%)    | (8%)   |       |
| 5) Single                   | 1896     | 1896     | 0      | 0     |
| precision                   | (1.044%) | (1.036%) |        |       |
| (Virtex-2                   |          |          |        |       |
| 1000) [6]                   |          |          |        |       |

Implementation results show that multipliers are the most hardware consuming part of the module and introduce the longest latency, this gave rise to the idea to design the dedicated speed optimized multipliers.

It is worth noticing that there is a large difference between resources absorbed by the standard and optimized multipliers versions. A considerable difference between the number of flip-flops is conspicuous due to pipeline mechanism employed together with optimized multipliers.

