Tải bản đầy đủ - 0 (trang)
2 ReneGENE-AccuRA: A Multichannel Implementation of AccuRA SRM Pipeline

2 ReneGENE-AccuRA: A Multichannel Implementation of AccuRA SRM Pipeline

Tải bản đầy đủ - 0trang

ReneGENE-GI: Empowering Precision Genomics with FPGAs on HPCs


memory bottleneck and storage issues, thus reducing the computing and I/O

burden on the host significantly. With an appropriate data streaming pipeline,

we provide an affordable solution, customizable according to scalability needs

and capital availability. It is also pluggable to any genome analysis pipeline for

use across multiple domains from research to clinical environment.



ReneGENE-AccuRA: Prototype and Results

Prototype Model for ReneGENE-AccuRA

ReneGENE-AccuRA was prototyped on an HPC platform supported with a

reconfigurable accelerator card built on multiple Xilinx Virtex 7 XC7V2000T

devices, that is scalable upto 633 million ASIC gates. The host interface is

through a Kintex-7 XC7K325T-FBG900 FPGA. The host processor is interfaced to the Kintex-7 FPGA via the high speed interface of PCI-E x8 gen3. The

embarassingly parallel bio-computing in AccuRA’s SRM is further favoured by

the inherent reprogrammability of FPGAs, massively parallel compute resources,

extreme data path parallelism and fine grained control mechanisms offered by

the FPGAs.


ReneGENE-AccuRA Software

The software stack comprises of the preprocessing and post-processing modules

of the ReneGENE-GI pipeline. This includes: (i) the reference index hashing

step based on the MMPH algorithm, (ii) the read-lookup algorithm against the

indexed reference for candidate genomic locations for a probable alignment, (iii)

the HPC platform specific libraries and middleware, (iv) the hardware abstraction layer with the corresponding device drivers and platform drivers, (v) the

post-processing module that makes decisions for the best alignment, secondary

alignments, the corresponding computations for alignment/map qualities and

(vi) a subsequent formatting of the output data in the Sequence Alignment

(SAM) format. The pipeline also allows conversion of the SAM file to its compressed Binary Alignment (BAM) file and its verification towards fitness for

downstream NGS data analytics. The software runs on a multi-core host (Intel

core i7 8-core processor) with 32 GB system RAM.


ReneGENE-AccuRA Hardware

The multi-channel ReneGENE-AccuRA is represented as DUT within each

FPGA. It is interfaced with the prototyping infrastructure on the FPGA through

the standard AXI4 interface with 256 bit-wide data bus, running at a frequency

of 125 MHz. The Address Remapper unit allows an automatic remapping of the

address spaces of DUT for transactions, allowing an ease of scalability in adding

more AccuRA SRM channels to the DUT. The implementation is done using

VHDL and Verilog.


S. Natarajan et al.


Scalability Analysis for ReneGENE-AccuRA

The parameters in scalability analysis for ReneGENE-AccuRA are given in

Table 1. Consider the multi-channel AccuRA SRM pipeline model, where the

reads are streamed in at the rate Rin (measured as Giga Reads/second or GR/s)

over an input streaming bandwidth of BWin (measured as Giga Bytes/second

Table 1. Scalability analysis parameters




Streaming input bandwidth


Short read length


Streaming buffer depth


Subsequence length for short read


Number of partitions for input short read, with overlapping


Number of bits for encoding each bp of input short read


Streaming buffer width


MAK unit clock period


DPK unit clock period


MAK unit operating cycles


DPK unit operating cycles


Total number of pairs launched on MAK-DPK units for SRM


Number of MAK-DPK units deployed on a single AccuRA SRM

pipeline channel




Number of pairs allotted for SRM, per MAK-DPK unit


No. of cell updates per DPK unit


No. of filter kernel operations per MAK unit


No. of filter kernel operations per MAK unit Rin =


Total MAK unit time to cover P pairs TM AK = x × τM AK


Total DPK unit time to cover P pairs TDP K = y × τDP K


Read Processing Rate of a single AccuRA SRM pipeline channel





Alignment Time of a single AccuRA SRM pipeline channel

TRHP = p × (TM AK + TDP K )



channel AccuRA




The total time invested in performing SRM in a single AccuRA

SRM pipeline channel

Tsingle channel AccuRA ≈ TLoad + TRHP + TU nload


Performance of MAK units within a single AccuRA SRM pipeline,

measured in terms of Giga Maps Per Second (GMPS)

N ×K

PM AK = x×T


Performance of DPK units within a single AccuRA SRM pipeline,

measured in terms of Giga Cell Updates Per Second (GCUPS)

N ×C

PDP K = y×T



ReneGENE-GI: Empowering Precision Genomics with FPGAs on HPCs


or GB/s). The m subsequences of short reads, each of length l, are streamed

through a streaming buffer of depth B, which holds one subsequence in each

word of storage.

Each MAK unit performs filtering in time TM AK , over x cycles of the MAK

unit clock, with period τM AK . Each DPK unit performs alignment in time TDP K ,

over y cycles of the DPK unit clock, with period τDP K . If N MAK-DPK units

are configured within a single AccuRA SRM pipeline channel, then each unit

gets its share of p pairs for performing SRM. The single AccuRA SRM pipeline

channel thus performs N SRMs in a total time of TM AK + TDP K , with N MAKDPK units running in parallel. The single channel hence processes reads at a

rate of RRHP measured in GR/s. At this rate, the hardware aligns all the P

reads, with p reads aligned in parallel over N MAK-DPK units, over a total

time of TRHP .

For scaling up the performance, let us include C such channels of AccuRA

SRM pipelines within a single FPGA. Here, each channel will take the same

amount of time to process the same number of reads.

Now, the overall performance from all the MAK units from C channels,

measured in terms of Giga Maps Per Second (GMPS), is given by:


C ×N ×K

x × TM AK


The overall performance of DPK unit, measured in terms of Giga Cell

Updates Per Second (GCUPS), is given by:


C ×N ×C

y × TDP K


Thus, we see that by scaling up the single AccuRA SRM pipeline channel, by

increasing N , the ReneGENE-AccuRA hardware gains a better throughput, as

it can handle more pairs in parallel. The scalability is complemented by further

scaling up the number of such channels, C, within a single FPGA. The number

of such channels within an FPGA is limited only by the allowed reconfigurable

hardware space for the DUT within the FPGA. The input data is then fairly

divided among the channels, so that the SRM process is complete in approximately 1/C times the total time taken for SRM by a single channel.


Results from Large Genome Benchmarks for ReneGENEAccuRA

The ReneGENE-AccuRA prototype was tested by running SRM for very large

data sets of the order of several Giga Bytes, for the mammalian human genome.

The details of the input data set is provided in Table 2. We have used the GrCh38

reference genome assembly, which is around 3 billion bases long, consisting of

all the 23 chromosomes and the mitochondrial DNA. We have considered the

alignment of three human genomes, each of which correspond to a family of father

(SRR1559289, SRR1559290, SRR1559291, SRR1559292, SRR1559293), mother

(SRR1559294, SRR1559295, SRR1559296, SRR1559297, SRR1559298) and their


S. Natarajan et al.

Table 2. Human genome experiment details

ID SRR read

No. of reads

No. of bases Buffer contents for SRM

No. of streamed




27594045 5.5G




28019239 5.6G






169777482 34G





168278483 33.7G





168484341 33.7G





180827103 36.2G





96741850 19.3G





148849161 29.8G





33028205 6.6G



10 SRR1559298

33621893 6.7G



11 SRR1559281

146929886 29.4G



12 SRR1559282

143848074 28.8G



13 SRR1559283

144871968 29G



14 SRR1559284

142831237 28.6G



child (SRR1559281, SRR1559282, SRR1559283, SRR1559284). Here, each read

is 200 bp long. The reads are subjected to lookup against the reference genome

index. Subsequently, they are sent for alignment on the FPGA by streaming over

the PCIe link through buffers that are configured to hold up to 18874368 words

of data in one batch.

The ReneGENE-AccuRA prototype was tested with single and dual channel

AccuRA SRM pipelines within a single FPGA while aligning the human short

read sets. Each channel hosted 16 MAK units and 16 DPK units. With this

configuration, to align 500 million reads (100 bases long) against the reference

genome (3 billion bases long), with each read reporting a mapping at five locations on the reference, ReneGENE-AccuRA performs 4.65 Tera map operations

and 10.24 Tera cell updates at the rate of 21.14 GMPS and 46.56 GCUPS in

about 3.68 min. The implementation results for the dual-channel ReneGENEAccuRA are provided in Table 3.

Table 3. ReneGENE-AccuRA utilization report, with single and dual channel AccuRA

SRM pipeline single Xilinx Virtex 7 XC7V2000T device


Single channel



single channel

Number of slice registers

128708 out of



213366 out of



Number of slice LUTs

170177 out of



310045 out of



Number of bonded IOBs

166 out of 850


96 out of 850


Number of block RAM/FIFO 203 out of 1292 15.71

Dual channel


utilization: dual


374 out of 1292 28.90

ReneGENE-GI: Empowering Precision Genomics with FPGAs on HPCs


Fig. 4. Performance comparison of FPGA versus GPU for human short read sets.

For the human genome read sets in Table 2, the alignment times for various configurations are shown in Fig. 4. Here, we can see that the time taken

by ReneGENE-AccuRA is about one-fifth (with single channel AccuRA SRM

pipeline) and about one-tenth (with dual channel AccuRA SRM pipeline), the

time taken by the single GPU OpenCL implementation of ReneGENE-GI’s

CGM. This single GPU implementation is itself 2.62x faster than CUSHAW2GPU (the GPU CUDA implementation of CUSHAW) [21,22]. With the singleGPU implementation demonstrating a speedup of 150x over standard heuristic

aligners in the market like BFAST [23], the reconfigurable accelerator version of

ReneGENE-AccuRA is several orders faster than the competitors, offering precision over heuristics. By extending the implementation to four and six channels

within a single FPGA, there is a definite increase expected in the performance as

evident from the scalability analysis. With multiple FPGAs available on the platform, the scope for further improvement in performance increases with increase

in number of FPGAs and number of channels supported within the FPGAs.



Through this paper, we have presented ReneGENE-GI, an innovatively engineered GI pipeline. The pipeline strikes the right balance between comparative genomics and de novo read extension, to run an irregular application like

GI. With parallel algorithms executed on reconfigurable accelerator hardware,

ReneGENE-GI exploits the inherent parallelism and scalability of the hardware

at the level of micro and system architecture, amidst fine-grain synchronization.

Supplemented with a multi-threaded firmware architecture, the Comparative

Genomics Module (CGM) in ReneGENE-GI precisely aligns short reads, at a

fine-grained single nucleotide resolution, and offers full alignment coverage of the

genome including repeat regions. The parallel dynamic programming kernels on

multiple channels of CGM seamlessly perform traceback process in hardware

simultaneously along with forward scan, thus achieving short read mapping in

minimum deterministic time. ReneGENE-GI is a fully streaming solution that


S. Natarajan et al.

eliminates memory bottleneck and storage issues, thus reducing the computing

and I/O burden on the host significantly. With an appropriate data streaming

pipeline, we provide an affordable solution, customizable according to scalability

needs and capital availability. It is also pluggable to any genome analysis pipeline

for use across multiple domains from research to clinical environment.


1. Frese, K.S., Katus, H.A., Meder, B.: Next-generation sequencing: from understanding biology to personalized medicine. Biology 2(4), 378–398 (2013)

2. Mardis, E.R.: A decade’s perspective on DNA sequencing technology. Nat. Perspect. 470, 198–203 (2011)

3. Stephens, Z.D., Lee, S.Y., Faghri, F., Campbell, R.H., Zhai, C., Efron, M.J., et al.:

Big data: astronomical or genomical? PLOS Biol. 13(7), e1002195 (2015)

4. Lee, C.Y., Chiu, Y.C., Wang, L.B., et al.: Common applications of next-generation

sequencing technologies in genomic research. Transl. Cancer Res. 2(1), 33–45


5. Alyass, A., Turcotte, M., Meyre, D.: From big data analysis to personalized

medicine for all: challenges and opportunities. BMC Med. Genom. 8, 33 (2015)

6. Costa, F.F.: Big data in genomics: challenges and solutions. G.I.T. Lab. J. 11(12),

2–4 (2012)

7. Baker, M.: Next-generation sequencing: adjusting to data overload. Nat. Methods

7, 495–499 (2010)

8. Chen, C., Schmidt, B.: Performance analysis of computational biology applications

on hierarchical grid systems. In: Proceedings of IEEE International Symposium on

Cluster Computing and the Grid, CCGrid 2004, Chicago, IL, pp. 426–433 (2004)

9. Bader, D.A.: High-performance algorithm engineering for large-scale graph problems and computational biology. In: Nikoletseas, S.E. (ed.) WEA 2005. LNCS, vol.

3503, pp. 16–21. Springer, Heidelberg (2005). https://doi.org/10.1007/11427186 3

10. SERC: Indian Institute of Science, Bangalore. Sahasrat (Cray XC40). http://www.


11. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv.

33(1), 31–88 (2001)

12. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences.

J. Mol. Biol. 147, 195–197 (1981)

13. Altschul, S.F., Bundschuh, R., Olsen, R., Hwa, T.: The estimation of statistical

parameters for local alignment score distributions. Nucl. Acids Res. 29, 351–361


14. Myers, E.: A sublinear algorithm for approximate keyword searching. Algorithmica

12, 345–374 (1994)

15. Treangen, T.J., Salzberg, S.L.: Repetitive DNA and next-generation sequencing:

computational challenges and solutions. Nat. Rev. 13, 36–46 (2012)

16. Flicek, P., Birney, E.: Sense from sequence reads: methods for alignment and assembly. Nat. Methods 6, S6–S12 (2009)

17. Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation

sequencing. Briefings Bioinform. 2, 473–483 (2010)

18. Hatem, A., Bozdag, D., Toland, A.E., Catalyurek, U.V.: Benchmarking short

sequence mapping tools. BMC Bioinform. 14, 184 (2013)

ReneGENE-GI: Empowering Precision Genomics with FPGAs on HPCs


19. Natarajan, S., KrishnaKumar, N., Pal, D., Nandy, S.K.: AccuRA: accurate alignment of short reads on scalable reconfigurable accelerators. In: Proceedings of IEEE

International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS XVI), pp. 79–87, July 2016

20. Natarajan, S., KrishnaKumar, N., Pavan, M., Pal, D., Nandy, S.K.: ReneGENEDP: accelerated parallel dynamic programming for genome informatics. In:

Accepted at the 2018 International Conference on Electronics, Computing and

Communication Technologies (IEEE CONECCT), March 2018

21. Liu, Y., Schmidt, B., Maskell, D.L.: CUSHAW: a CUDA compatible short read

aligner to large genomes based on the Burrows-Wheeler transform. Bioinformatics

28(14), 1830–1837 (2012)

22. Liu, Y., Schmidt, B.: CUSHAW2-GPU: empowering faster gapped short-read alignment using GPU computing. IEEE Des. Test Comput. 31(1), 31–39 (2014)

23. Homer, N., Merriman, B., Nelson, S.F.: BFAST: an alignment tool for large scale

genome resequencing. PLoS ONE 4, e7767 (2009)

FPGA-Based Parallel Pattern Matching

Masahiro Fukuda1,2(B)


and Yasushi Inoguchi1

Japan Advanced Institute of Science and Technology, Ishikawa, Japan



National Institute of Technology, Ishikawa College, Ishikawa, Japan

Abstract. To protect IoT (Internet of Things) nodes against cyber

attacks, NIDS (Network-based Intrusion Detection System) is becoming important. On future high-speed networks, NIDS needs to be highspeed hardware and parallelization is inevitable. The pattern matching of

PCRE (Perl Compatible Regular Expressions) is one of the most complex parts in NIDS. We tried to improve the parallelization of PCRE

pattern matching in Snort, implementing it on an FPGA. The essence

of our method is eliminating memory from STEs (State Transition Elements), which is the bottleneck of parallelization. Our evaluation shows

the proposed method is 8.37 times faster than a previous method.

Keywords: Field-Programmable Gate Array

Perl Compatible Regular Expressions · State Transition Element



Recently, IoT (Internet of Things) is becoming prevalent. Household appliances

and office supplies, even such as cameras, printers and digital video recorders as

well as personal computers and smartphones are on the Internet. Currently these

embedded equipments are often vulnerable mainly because of unchanged initial

passwords. However, there is another near future problem that they generally do

not have the computing power to execute antivirus software or anomaly analysis

software to deal with cyber attacks.

Therefore, we think that NIDS (Network-based Intrusion Detection System)

or NIPS (Network-based Intrusion Prevention System) will become important

for such systems. NIDS is put on a computer network and monitors packets

to detect intrusions. They detect attacks on the network rather than the host.

Their main difficulty is executing complicated pattern matching of PCRE (Perl

Compatible Regular Expressions) at high-speed. For example, Snort [1] is a

famous open-source software of NIDS/NIPS and now developed by Sourcefire

Inc owned by Cisco Systems. It can analyze traffic on IP (Internet Protocol)

network and contains thousand kinds of PCREs such as /buy\x2f\?code\=\d/,

/or[\s\x2f\x2A]+1 = 1/ or so on.

The network speed is increasing from 1 Gbps to 400 Gbps. Table 1 shows

examples of the most famous MAC (Media Access Control) sublayer interfaces

c Springer International Publishing AG, part of Springer Nature 2018

N. Voros et al. (Eds.): ARC 2018, LNCS 10824, pp. 192–203, 2018.


FPGA-Based Parallel Pattern Matching


standardized by IEEE [2]. It is obvious that the parallelization of pattern matching is inevitable for dealing with multiple 8-bit input symbols without increasing

the clock frequency more than now.

Table 1. Examples of MAC sublayer interfaces

Speed [Gbps] Clock freq. [MHz] Clock period [ns] Bus width [bits]





















Our research purpose is accelerating pattern matching of PCRE in NIDS

without interfering high-speed networks. Software implementation is not suitable

for 100 Gbps because the delay of memory or input/output is generally longer

than 0.64 ns and fine-grained control of on-chip memory is difficult. Hence our

choice is a hardware implementation, especially FPGA. We also aimed for high

parallelization instead of increasing the clock frequency.

The rest of this paper is organized as follows. Section 2 reviews related works.

Section 3 describes a previous research. Section 4 introduces our method without memory and how to parallelize the circuit. Section 5 evaluates our method.

Section 6 concludes the work.



PCRE and Related Works


A PCRE (Perl Compatible Regular Expressions) is a regular expression in Perl

language. It is used in Snort to define and detect strings included in complicated

cyber attacks.

For example, /or[\s\x2f\x2A]+1 = 1/ is a PCRE. [\s\x2f\x2A] is called a

character class and it matches \s, x2f or x2A. \s means white space characters,

including SP (SPace) LF (Line Feed), HT (Horizontal Tab.) and so on. \x2f

and \x2A means slash and asterisk in ASCII character table, respectively. Other

characters (o, r, +, 1, = and 1) are not character classes (single characters).

Totally, /or[\s\x2f\x2A]+1 = 1/ matches “or 1 = 1”, “or[HT]1 = 1” or so on,

where [HT] means a horizontal tab.

There are some processes in Snort, such as detecting HTTP requests and

responses, extracting Cookies and so on, but processing PCRE is the most difficult part because of the increase of the dataset 3.



M. Fukuda and Y. Inoguchi

Software-Based Approaches

Fast software implementations tend to be DFA (Deterministic Finite Automaton) based and need memory of large capacity, because NFA (Non-deterministic

Finite Automaton) based methods have to read the state transition table from

memory many times. Pu et al. developed a software tool to translate regular

expressions and the execution time is reduced by 66%, but is still 10.7 ms [4].

Another research by Yi et al. presents SFA (Semi-deterministic Finite Automaton) and it is between DFA and NFA in terms of computational and memory

complexities, but it did not reach even 1 Gbps throughput [5]. Such software

implementations take advantage of an off-chip large memory and it cannot provide a fast access of one clock cycle.

There are also several researches on GPU-based approach. It can be faster

than CPU and implemented by software, but the power consumption is high. For

example, Zu et al. achieved 10 Gbps and more with NVIDIA GTX-460 [6], but

such a GPU consumes 100 W at least. Considering embedding it into network

equipments, typical GPUs are not the best choice.


Hardware-Based Approaches

FPGA (Field-Programmable Gate Array) is used in or near network equipments

for various applications [7] and many efficient implementations of regular expression matching on FPGA are proposed in recent years 3. Furthermore, in terms

of power consumption, FPGAs are a few watts to tens of watts, even relatively

high-end devices such as Virtex-7.

As one of the most recent researches, Dlugosch et al. presented a semiconductor implementation of AP (Automata Processor) corresponding to many functions of PCRE [8]. In this research, it aimed to general-purpose pattern matching

and used much memory.

Roy’s research is a case where Dlugosch’s AP was applied to Snort and so

on [9]. It reports an estimate that 10.3 Gbps can be realized, but it is the ideal

performance when 48 chips are used and communication breakdown is a certain


As another research, Cronin et al. proposed an efficient FPGA implementation for quantifications of PCRE [10]. However, this method needs BRAM (Block

RAM) for quantifications and can be a bottleneck of parallelism even here.

In this paper, we will compare our method with Dlugosch’s research as a

previous study. Our method eliminates memory, which was the bottleneck of

high parallelization to be realized.




Full-STE’s Architecture

Automata Processor (AP) [8] presented by Dlugosch can perform pattern matching when PCRE or ANML (Automata Network Markup Language), both of

FPGA-Based Parallel Pattern Matching


which are languages capable of describing NFA, is compiled and loaded. ANML

is beyond the scope of this paper, but in both cases, STE (State Transition Element) is one of the basic elements of this AP, and one STE corresponds to one

state transition. Each STE determines whether or not its state transition is done

per input symbol.

Figure 1 shows Full-STEs, which are simplified from Dlugosch’s ones by omitting an OR-gates. A Full-STE is composed of a D-FF, an AND-gate and 256-bit

memory. More specifically, when an input symbol s is inputted, Decoder recognizes s and asserts only the corresponding RE(s), Recognition Enable. Memory

reads the Mb(s) bit corresponding to RE(s). This Mb(s) represents whether or

not the condition of state transition is satisfied. On the other hand, a D-FF in

the bottom of the Fig. 1 represents whether or not the source state is active. DFF is always 1 if it is the initial state. If the source state is currently active and

the condition of state transition is satisfied, then the destination state becomes

active. Match signal is the output signal and represents that a match of one of

PCREs at least is detected.

Fig. 1. Full-STE’s architecture

Let y be the number of Full-STEs and it depends on the size of the rule

set. They are connected by Routing Matrix, which determines the order of state

transitions. In Dlugosch’s paper, the maximum number of available Full-STEs

was y = 49, 152 and the size of memory was 256 × 49, 152 = 12, 582, 912 bits, or

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

2 ReneGENE-AccuRA: A Multichannel Implementation of AccuRA SRM Pipeline

Tải bản đầy đủ ngay(0 tr)