Tải bản đầy đủ - 0 (trang)
2 Full-STE's Problem in Snort Case

2 Full-STE's Problem in Snort Case

Tải bản đầy đủ - 0trang

FPGA-Based Parallel Pattern Matching


parallelization. VC709 has two external DRAMs (8 GB in total, DDR3) and they

can be read by more than 10 Gbps, but it takes more than 1 clock cycle per read

and is not useful in the case of random accesses.

According to a further analysis below, Full-STE is wasteful of memory when

the state transition condition is not a character class but a single character. For

example, considering a Full-STE that makes a state transition with the letter

A, only the bit in the address 41 h is 1 according to the ASCII code. All other

255 bits of memory are 0s. Even character classes are often wasteful. For example,

a character class \d means 0, 1, · · · , 9, and it turns only 10 bits out of 256 bits

to 1.

We examined how much single characters and character classes are in the

snort rule set. The result was Table 3 63.7% was a single character. In other

words, if we apply Dlugosch’s AP to the Snort rule set, the Full-STE that only

1 bit is 1 occupies 63.7%. We think that Dlugosch’s research aimed to create a

general-purpose AP, but when trying to apply it to the Snort rule set, memory

usage is inefficient. That is the cause of prevention of parallelization.

Table 3. Statistics about characters in Snort PCRE

# of appearences (percentage) # of kinds (percentage)

Single character

Character class


78,544 (63.7)

123 (42.7)

44,666 (36.3)

165 (57.3)

123,210 (100.0)

288 (100.0)

If we can eliminate such big memory of Full-STEs, the bottleneck might be

removed and further parallelization be achieved. Actually, combinatorial circuits

to recognize all single characters and character classes in the Snort rule set can

be implemented.



Proposed Methods


Our method, Single-STEs, shown in Fig. 2, upgrade Decoder to Character Class

Recognizer instead of using memory. Character Class Recognizer is a combination circuit that judges whether an input symbol s corresponds to various

characters or character classes and outputs it as RE. When an input symbol

s is inputted, Character Class Recognizer outputs RE(s) and it goes into the

Logic part (AND gate) of Single-STE through wiring like a matrix switch in

the top right of the figure. The other, including Routing Matrix, is the same as


Character Class Recognizer lets multiple REs be 1 at the same time unlike

Decoder. In order not to interfere with them, only one RE for each Single-STE

enter the lower side AND gate.

The strength of Single-STE is to eliminate memory. The weakness is that

changing the rule set has to reconfigure Character Class Recognizer. If it is


M. Fukuda and Y. Inoguchi

Fig. 2. Single-STE’s architecture

premised to implement as semiconductor, it is advantageous that Full-STE can

cope with change rule set simply by rewriting memory. But Single-STE can

eliminate BRAM and can be highly parallelized in case of FPGA implementation.

About the example of /or[\s\x2f\x2A]+1 = 1/, 7 Single-STEs are needed

just like Full-STEs. The conditions that D-FFs and Match signal are activated

are also the same as Full-STEs. In the Single-STEs case, there is Character Class

Recognizer instead of the memory in Full-STEs. In this case, Character Class

Recognizer outputs RE(0), RE(1), · · · , RE(5) and each means that the input

symbol is o, r, [\s\x2f\x2A], +, 1 and =. The final 1 of the PCRE is represented

as RE(4).

Above all, Single-STEs do not need any BRAM and the bottleneck of parallelization is eliminated. Furthermore, a duplication of 1 in this PCRE has

been removed and the resource usage of LUTs is saved. They contribute a high




Parallelization is done like Fig. 3. It is an example of two parallelization case.

The left half circuit is exactly the same as Single-STEs shown in Fig. 2, but the

wiring is changed in Routing Matrix. The circuit of the right half is also almost

same as Fig. 2, but a D-FF in each Single-STE is removed and the circuit does

not hold the current states. The role is to inform the left half circuit of only the

result of state transition.

FPGA-Based Parallel Pattern Matching


The wiring of Routing Matrix is as follows. The output of Single-STE(0),

Single-STE(1), Single-STE(2) in the left half is connected to the input of SingleSTE’(1), Single-STE’(2), Single-STE’(3) in the right half, respectively. On the

other hand, the output of Single-STE’(0), Single-STE’(1), Single-STE’(2) in the

right half is connected to the input of Single-STE(1), Single-STE(2), SingleSTE(3) in the left half, respectively. These wirings let the left half process s1

and the right half process s2 just after s1 . Finally, Match signal is OR of the

output of Single-STE(3) in the left half and Single-STE’(3) in the right half.

Fig. 3. Parallelization of Single-STEs

More specifically, when a stream s = s1 s2 s3 · · · is inputted, the first two symbols s1 and s2 come into Fig. 3 in a clock cycle. The first symbol s1 comes to the

Character Class Recognizer of the left half in the figure, and the second symbol

s2 comes to the right half. Then REs (Recognition Enables) corresponding to

s1 and s2 are asserted and they go to logic parts of STEs. The outputs of STEs

in the left half represent the next states by inputting s1 . Then, the outputs of

STEs in the right half represent the next states by inputting s2 after s1 . They

go to the left half and update the states of D-FFs. The above is done in one

clock cycle because Single-STE’(0), Single-STE’(1), · · · in the right half do not

include D-FFs.

By doing the same thing, it is theoretically possible to have parallel degree 3

or more. When trying to parallelize with Full-STE, BRAM doubles and triples,

and it puts pressure on resources.


Automatical Conversion from PCRE to Verilog HDL

We developed a tool to automatically convert PCREs to Verilog HDL source files.

The outlined way to generate them is as follows. Since this is not the subject of

this paper, we will not go into details.


M. Fukuda and Y. Inoguchi

1. Remove PCREs including unsupported functions

2. Generate Character Class Recognizer

3. Generate Matrix Switch and Routing Matrix

In this way, disjunctions, quantifiers, etc. are not yet supported at the moment,

so PCREs including them are eliminated. At this point, there were 7,898 rules

and reduced to only 47. After that, we generate Verilog HDL of Character Class

Recognizer, Matrix Switch and Routing Matrix. In the case of Full-STE, a COE

file representing memory data is generated instead of a Character Class Recognizer. Then Routing Matrix is generated.




Experimental Conditions

Synthesis and Implementation have been performed using Xilinx Vivado 2017.3.

Their strategies were Vivado Synthesis Default and Vivado Implementation

Default. The board was VC709 Connectivity Kit. In the constraint file, I set

the clock cycle to 8 ns. The Snort rule set is a Registered edition,


Resource Usage

The report on resource usage by Vivado is shown in the Tables 4 and 5. p is

the degree of parallelism. The target board is VC709 Connectivity Kit and the

numbers of available LUTs, FFs and BRAMs are 433,200, 866,400 and 1,470,


Table 4. Resource usage of Full-STEs


p = 1 p = 2 p = 4 p = 8 p = 16 p = 32












BRAMs 8.5






1,277 2,453


Table 5. Resource usage of Single-STEs


p = 1 p = 2 p = 4 p = 8 p = 16 p = 32


















1,328 2,508 4,920


Figures 4 and 5 shows the resource utilizations ratio of Full-STE and SingleSTE. BRAM is obviously the bottleneck of parallelization in Full-STE and there

is no BRAM used in Single-STE. Single-STEs appears to be parallelized up to

Resource utilization [%]

FPGA-Based Parallel Pattern Matching















Degree of parallelism

Resource utilization [%]

Fig. 4. Resource utilizations ratio of Full-STE
















Degree of parallelism

Fig. 5. Resource utilizations ratio of Single-STE

1,440 with the bottleneck of LUTs, while Full-STEs can be parallelized up to

172 with the bottleneck of BRAMs.

This simulation tells that it is expected that the degree of parallelism of

Single-STEs can be greatly improved compared to Full-STEs. With Full-STEs

for 47 PCREs, the degree of parallelism appears to be 172 at the maximum. In

the case of Single-STEs, it is about 1,440, or 8.37 times of Full-STEs.


Timing Requirements

The report of timing analysis by Vivado is shown in the Tables 6 and 7. WNS

(Worst Negative Slack), WHS (Worst Hold Stack) and WPWS (Worst Pulse

Width Slack) represent the margins to meet the timing requirements. The WNS,

WHS and WPWS are to meet the setup time, the hold time and the clock period

of 8 ns, respectively. Obviously, parallelization decreases these slacks because the

critical path becomes longer by the delay of wiring of Routing Matrix. But there

is still plenty of slack even if the degree of parallelism is 32. When non-parallel

circuit can process 8-bit input symbols per 8 ns and the degree of parallelism

becomes 32, the throughput of the total circuit is 32 Gbps.


M. Fukuda and Y. Inoguchi

Table 6. Slack of Full-STE


p = 1 p = 2 p = 4 p = 8 p = 16 p = 32

WNS [ns]

5.479 6.006 5.512 3.654 3.095

WHS [ns]

0.103 0.085 0.027 0.233 0.118


WPWS [ns] 3.600 3.600 3.600 3.600 3.600



Table 7. Slack of Single-STEs



p = 1 p = 2 p = 4 p = 8 p = 16 p = 32

WNS [ns]

6.823 5.911 5.884 4.209 4,759


WHS [ns]

0.090 0.084 0.085 0.111 0.111


WPWS [ns] 3.358 3.600 3.600 3.600 3.600



In this paper, we presented an FPGA implementation to accelerate pattern

matching of PCRE used in Snort (NIDS). In Full-STE method, how to use

the on-chip memory was wasteful and it was the bottleneck of parallelization.

As an improved method, Single-STE was proposed and it uses Character Class

Recognizer instead of the memory. Single-STE eliminates the bottleneck of parallelization by using no memory for conditions of state transitions. Although

Character Class Recognizer in Single-STE requires more LUTs than Full-STE,

the potential degree of parallelism is still higher for a high-end device such as

Virtex-7. Our evaluation shows that the degree of parallelism of Full-STE was

1, 470/(272/32) = 172 at most with 47 PCREs. On the other hand, that of

Single-STE could be 433,200/(9,625/32) = 1,440, or 8.37 times than Full-STE.

Future work will support more Snort rules and evaluate them. Like a previous

work [8], disjunction and quantifier should be supported and back reference is

also desired.


1. Snort. http://www.snort.org/

2. LAN/MAN Standards Committee of the IEEE Computer Society: IEEE Standard

for Ethernet (2015)

3. Hieu, T.T., Thinh, T.N., Vu, T.H.: Optimization of regular expression processing

circuits for NIDS on FPGA. In: Proceedings of Second International Conference

on Networking and Computing, pp. 105–112 (2011)

4. Pu, S., Tan, C.-C., Liu, J.-C.: SA2PX: a tool to translate SpamAssassin regular

expression rules to POSIX. In: Proceedings of 6th Conferences on Email and AntiSpam, pp. 1–10 (2009)

5. Yang, Y.-H., Prasanna, V.K.: Space-time tradeoff in regular expression matching

with semi-deterministic finite automata. In: Proceedings of IEEE INFOCOM, pp.

1853–1861 (2011)

FPGA-Based Parallel Pattern Matching


6. Zu, Y., Yang, M., Xu, Z., Wang, L., Tian, X., Peng, K., Dong, Q.: GPU-based

NFA implementation for memory efficient high speed regular expression matching.

ACM SIGPLAN Not. 47(8), 129–140 (2012)

7. Fukuda, M., Inoguchi, Y.: Probabilistic strategies based on staged LSH for speedup

of audio fingerprint searching with ten million scale database. In: Proceedings

of International Symposium on Highly-Efficient Accelerators and Reconfigurable

Technologies (2017)

8. Dlugosch, P., Brown, D., Glendenning, P., Leventhal, M., Noyes, H.: An efficient

and scalable semiconductor architecture for parallel automata processing. IEEE

Trans. Parallel Distrib. Syst. 25(12), 3088–3098 (2014)

9. Roy, I., Srivastava, A., Nourian, M., Becchi, M., Aluru, S.: High performance pattern matching using the automata processor. In: IEEE Parallel and Distributed

Processing Symposium, pp. 1123–1132 (2016)

10. Cronin, B., Wang, X.: Hardware acceleration of regular expression repetitions in

deep packet inspection. Inst. Eng. Technol. Inf. Secur. 7(4), 327–335 (2013)

Embedded Vision Systems:

A Review of the Literature

Deepayan Bhowmik(B) and Kofi Appiah

Department of Computing, Sheffield Hallam University, Sheffield S1 1WB, UK


Abstract. Over the past two decades, the use of low power Field Programmable Gate Arrays (FPGA) for the acceleration of various vision

systems mainly on embedded devices have become widespread. The

reconfigurable and parallel nature of the FPGA opens up new opportunities to speed-up computationally intensive vision and neural algorithms

on embedded and portable devices. This paper presents a comprehensive review of embedded vision algorithms and applications over the past

decade. The review will discuss vision based systems and approaches, and

how they have been implemented on embedded devices. Topics covered

include image acquisition, preprocessing, object detection and tracking,

recognition as well as high-level classification. This is followed by an

outline of the advantages and disadvantages of the various embedded

implementations. Finally, an overview of the challenges in the field and

future research trends are presented. This review is expected to serve as

a tutorial and reference source for embedded computer vision systems.



Scene understanding and prompt reaction to an event is a critical feature for any

time critical computer vision system. The deployment scenarios include a range

of applications such as mobile robotics, autonomous cars, mobile and wearable

devices or public space surveillance (airport/railway station). Modern vision

systems which play a significant role in such interaction process require higher

level scene understanding with ultra-fast processing capabilities operating at

extremely low power. Currently, such systems rely on traditional computer vision

techniques which often follow compute intensive brute-force approaches (slower

response time) and prone to fail in environments with limited power, bandwidth

and computing resources. The aim of this paper is to review state-of-the-art

embedded vision systems available from the literature and in the industry; and

therefore to aid researchers for future development.

Research into computer vision has made steady and significant progress in the

past two decades. The tremendous progress, coupled with cheap computational

power has enabled many portable and embedded devices to operate with vision

capabilities. Digital Signal Processing and for that matter Digital Image Processing (DIP) is an exciting area to be involved in today. Having been around

for over two decades, it is typically used in application areas where cost and

c Springer International Publishing AG, part of Springer Nature 2018

N. Voros et al. (Eds.): ARC 2018, LNCS 10824, pp. 204–216, 2018.


Embedded Vision Systems: A Review of the Literature


performance are key [7], including the entertainment industry, security surveillance systems, medical systems, automotive industry and defence. DIP systems

are often implemented using the ubiquitous general purpose processors (GPPs).

The increasing demand for high-speed has resulted in the use of dedicated Digital Signal Processors (DSPs) and General Purpose Graphics Processing Units

(GPGPU); special types of GPP optimised for signal processing algorithms.

However, power dissipation is important in almost all DSP-based consumer electronic devices; hence the high-speed, power-hungry GPPs become unattractive.

Battery-powered products are highly sensitive to energy consumption, and even

line-powered products are often sensitive to power consumption [41]. For hardware acceleration and low power consumption, DIP designers have opted for

alternatives like the Field Programmable Gate Array (FPGA) and Application

Specific Integrated Circuits (ASIC).

The use of FPGAs in application areas like communication, image processing and control engineering has increased significantly over the past decade [54].

Computer vision and image processing algorithms often perform a large number

of inherently parallel operations, and are not good candidates for implementation on machines designed around the von Neumann architecture. Some image

processing algorithms have successfully been implemented on embedded system

architectures running in real-time on portable devices [35,45], and relatively

small literature has been dedicated to the development of high-level algorithms

for embedded hardware [39,63]. The demand for real-time processing in the

design of any practical imaging system has led to the development of the Intel

Open source Computer Vision library (OpenCV) for the acceleration of various

image processing tasks on GPPs [46]. Many imaging systems rely heavily on the

increasing processing speed of today’s GPPs to run in real-time.


Application Specific Vision Systems

Every embedded vision systems follows a common pipeline of image processing

functional blocks as depicted in Fig. 1. The image sensor or camera is the starting

Fig. 1. Vision system pipeline.


D. Bhowmik and K. Appiah

point of this pipeline followed by a frame grabber that controls the frame synchronization and frame rate. The raw pixels are then passed for further processing which includes image pre-processing, feature extraction and classification.

Within this higher level abstraction various vision systems implemented required

functionalities as shown in the figure. Image preprocessing functions are often

pixel processing and offer stream computations. However features extraction and

classification tasks are complex in nature and usually involves non-deterministic

loop conditions. Analysis and optimisations [59] of such complexity with respect

to performance and power [13] is an emerging topic of interest and often seen as

a trade-off including the choice of the hardware.

Embedded vision systems are usually developed either to accelerate complex

algorithms that handles large stream of image data, e.g., stereo matching, video

compression etc.; or to minimize power at resource constraint systems including

unmanned aerial vehicle (UAV) or autonomous driver assistant systems. While

a large number of applications of embedded vision systems can be found in the

literature, they can be grouped to major application areas including robotics,

face detection applications, multimedia compression, autonomous driving and

assisted living as shown in Table 1. Various implementation techniques are proposed in the literature that considers a range of image processing algorithms.

Efforts were made either to parallelize the algorithms, or to approximate computing to reduce computational complexities.

While the first approach has implications in performance improvement, the

latter ones are more suitable for low power applications. Popular higher level

complex image processing algorithms that are used in embedded computer vision

Table 1. Embedded vision application areas. UAV: unmanned aerial vehicle; AUV:

autonomous underwater vehicle.




Media com- Autonomous Assisted




UAV Mobile robot AUV

Cesetti et al. [15]


Humenberger et al. [31]


Yang et al. [70]


Chen et al. [17]


Velez et al. [65]


Yang et al. [69]


Lin et al. [42]


Oleynikova et al. [50]


Flores et al. [22]


Xu and Shen [68]

Wang and Yu [67]




Abeydeera et al. [1]


He et al. [28]


Basha and Kannan [10]


Embedded Vision Systems: A Review of the Literature


literature includes stereo vision, feature extraction and tracking, motion estimation, object detection, scene segmentation and more recent convolutional neural

network (CNN). These categories and corresponding literature are captured in

Table 2.

Table 2. Common high level algorithms used in embedded vision systems.

Feature point Stereo



Park et al. [51]




Belbachir et al. [11]


Banz et al. [8]

Cesetti et al. [15]



Humenberger et al. [32]

Lin et al. [42]



Oleynikova et al. [50]

Flores et al. [22]

Ttofis et al. [64]

Scene seg- CNN



Jin et al. [36]

Chen et al. [17]



estimation detection






He et al. [28]


Basha and Kannan [10]


Liu et al. [43]



Zhao et al. [74]





Embedded Vision Systems

Central Processing Unit (CPU)

The widespread adoption of imaging and vision applications in industrial

automation, robotics and surveillance calls for a better way of implementing

such techniques for real-time purposes. The need to address the gap in knowledge for students who have either studied computer vision or microelectronics to fill positions in the industry requiring both expertise has been address

with the introduction of various CPU based platforms like Beagleboard [47]

and Raspberry-Pi [48]. Hashmi et al. [27] used a beagleboard-xM low-power

open-source hardware to prototype a real-time copyright protection algorithm.

A human tracking system which reliably detect and track human motion has

been implemented on a beagleboard-xM [24]. In [5], a LeopardBoard has been

used to implement an efficient edge-detection algorithm for tracking activity

level in an indoor environment. Similarly, Sharma and Kumar [56] presented

an image enhancement algorithm on a beagleboard, mainly for monitoring the

health condition of an individual. To demonstrate the efficiency of embedded

image processing Sahani and Mohanty [55] showcased various computer vision

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

2 Full-STE's Problem in Snort Case

Tải bản đầy đủ ngay(0 tr)