1. Trang chủ >
  2. Công Nghệ Thông Tin >
  3. Kỹ thuật lập trình >

1 Users' Favor of Activity-Partner Recommendation

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (30.67 MB, 785 trang )

Automated Detection for Probable Homologous

Foodborne Disease Outbreaks

Xiao Xiao1 , Yong Ge2 , Yunchang Guo3 , Danhuai Guo1 , Yi Shen1 ,

Yuanchun Zhou1 , and Jianhui Li1(B)


Computer Network Information Center, Chinese Academy of Sciences,

Bejing, China



University of North Carolina at Charlotte, Charlotte, USA



Division of Foodborne Disease Surveillance,

China National Center for Food Safety Risk Assessment, Beijing, China


Abstract. Foodborne disease, a rapid-growing public health problem,

has become the highest-priority topic for food safety. The threat of

foodborne disease has stimulated interest in enhancing public health

surveillance to detect outbreaks rapidly. To advance research on food

risk assessment in China, China National Center for Food Safety Risk

Assessment (CFSA) sponsored a project to construct an online correlation analysis system for foodborne disease surveillance beginning in

October 2012. They collect foodborne disease clinical data from sentinel hospitals across the country. They want to analyze the foodborne

disease outbreaks existed in the collected data and finally find the link

between pathogen, incriminated food sources and infected persons. Rapid

detection of outbreaks is a critical first step for the analysis. The purpose

of this paper is to provide approaches that can be applied to an online system to rapidly find local and sporadic foodborne disease outbreaks out of

the collected data. Specifically, we employ DBSCAN for local outbreaks

detection and solve the parameter self-adaptive problem in DBSCAN.

We also propose a new approach named K-CPS (K-Means Clustering

with Pattern Similarity) to detect sporadic outbreaks. The experimental

results show that our methods are effective for rapidly mining local and

sporadic outbreaks from the dataset.

Keywords: Foodborne disease outbreak detection · Clustering · Parameters self-adaptive · Frequent patterns

This work is partly supported by Special Research Funding of National Health

and Family Planning Commission of China under grant No.201302005, Natural

Science Foundation of China under Grant No. 41371386, 91224006, the Strategic

Priority Research Program of the Chinese Academy of Sciences under Grant No.

XDA06010307, XDA05050601, 12th Five-Year Plan for Science & Technology Support under Grant No.2013BAD15B02.

c Springer International Publishing Switzerland 2015

T. Cao et al. (Eds.): PAKDD 2015, Part I, LNAI 9077, pp. 563–575, 2015.

DOI: 10.1007/978-3-319-18038-0 44



X. Xiao et al.


The threat of foodborne disease has stimulated interest in public health surveillance [1][2]. Theoretically, for foodborne disease, there is a link between pathogens,

incriminated food sources and each infected person. How to find out the link is crucial for foodborne disease surveillance. Analysis of foodborne outbreak data is one

approach to find the link and it can estimate the proportion of human cases of

specific enteric diseases attributable to a specific food item. [3] employed multiple correspondence analysis(MCA) to further explore the relationship between

micro-organism, region and food vehicle. The analysis of foodborne outbreak data

is perceived as food attribution and is an important tool in food safety risk analysis [4][5]. To advance research on food risk assessment in China, CFSA sponsored

a project to construct an online correlation analysis system for foodborne disease

surveillance beginning in October 2012. CFSA collects foodborne disease clinical data from sentinel hospitals. They want to analyze the foodborne disease outbreaks exist in the collected data and finally find out the link between pathogen,

incriminated food source and infected persons. A primary purpose of the project is

to detect problems in food and water production and delivery systems that might

otherwise have gone unnoticed. Rapid detection of outbreaks is a critical first step

to abate these active hazards and preventing their further recurrences. But how to

rapidly find the outbreaks in the data is a problem for them. The purpose of this

paper is to provide approaches that can be applied to the online system to rapidly

find the local and sporadic foodborne disease outbreaks out of the collected data.

There are some researches focused on disease outbreak detection. Clearly,

when an epidemic sweeps through a region or a foodborne outbreak emerges,

there will be extreme perturbations in the number of hospital visits. So some

anomaly detection approaches have been used to detect disease outbreaks based

on the change of morbidity number. These methods require a baseline number

which can be derived from historical data. [6] employs Fishers Exact Test to

examine whether a rule occurs today is abnormal or not based on the historical

occurrences. [7] develops a simple randomization-based framework to recognize

significant increases in event counts. Besides, intense spatial aggregation was

often observed in disease outbreaks. [8] presents a fast multi-resolution method

to detect significant spatial disease clusters. Given a grid of squares, where each

square has a count and an underlying population, the goal of the paper is to

find the square region with the highest density, and to calculate its significance

by randomization. [9] is the improvement of [8], which uses a novel overlap-kd

tree data structure to reduce the time complexity to find the spatial disease

clusters. [10] introduces a novel fast spatial scan algorithm, generalizing the

2D scan algorithm of [9] to arbitrary dimension. The work above on cluster

detection is purely spatial in nature. But for most disease cluster problems, time

is an essential component. Fortunately, there exist methods for the detection of

emerging space-time disease clusters. [11] proposes a new class of spatio-temporal

cluster detection methods designed for the rapid detection of emerging spacetime clusters. It focuses on detecting space-time clusters of disease cases resulting

from an emerging disease outbreak.

Automated Detection for Probable Homologous Foodborne


Although many approaches are proposed for disease outbreak detection, they

are unsatisfactory for the problem we want to solve. In our scenario, the data

collection started in October, 2012. So the data available is very limited and there

are not enough historical data to predict a baseline. In fact, for foodborne disease

outbreaks detection, the method employed depends on what data is available.

Buckeridge et. give a practical classification for outbreak detection algorithms by

considering the types of information encountered in surveillance analysis [12]. In

our situation, we have a database of clinical cases from the 615 sentinel hospitals

in 34 provinces, municipalities or autonomous regions across the country. Each

record in this database contains information about the individual who has seen

a doctor.

When many people infect a foodborne disease in a short time in a nearby

location, we call that a local foodborne disease outbreak (LFDO), while if the

locations are not limited in a small area, we call that a sporadic foodborne

disease outbreak (SFDO). In this paper, we employ a density-based algorithm

for discovering clusters in large spatial databases with noise (DBSCAN) [13] to

detect LFDO and solve the parameter self-adaptive problem in DBSCAN. We

propose a new approach to detect SFDO.

The rest of this paper is organized as follows. After a description of the

dataset used in this paper in Section 2, we detail the detection approaches

in Section 3. Then in the following section, we present our experiments. In

Section 5, we give an analysis and a discussion of our experiment results. Finally,

in Section 6 we conclude our work and provide an outlook of the future work.


Data Collection

Diarrhea is the commonest symptom of foodborne illness. Foodborne diarrhea

provides one of the strongest signals for food safety. CFSA started to collect

information from diarrheal patients who visit the sentinel hospitals in October

2012. The rules of data collection are: a) the information collected are all the

diarrheal cases, but not all the diarrheal cases of sentinel hospitals are collected;

b) only the cases with diarrhea 3 or more than 3 times per day and character

of stool is abnormal are recorded; c) each sentinel hospital is required to collect

at least 10 cases per week. The above data collection strategy is waiting to be

improved. Currently, the detection has certain limitations in the way data is

recorded. Clearly, under this record strategy, the number of the cases doesn’t

reflect a true disease occurrence. But the change of the number of cases is a

significant signal for an outbreak. Thus, for this dataset, we can’t make use of

the methods that based on the number of cases to detect outbreaks.

Each record includes the information about the individuals. This information

contains fields such as age, gender, career, symptoms exhibited, home location,

diseased time and sampling or not (collected anal swab and stool). Parts of

these records have incriminated food information. This information includes food

name, food band, manufactures, place of purchase, place of eating, time of eating

and sampling or not. In this paper, we mainly use home location, diseased time


X. Xiao et al.

and symptoms exhibited to detect probable local homologous foodborne disease

outbreaks. The incriminated food information is mainly used to preliminarily

verify whether the outbreak clusters detected are homologous or not. For sporadic outbreaks, we use symptoms exhibited field combined with the food name

to detect sporadic outbreaks. Note that the preliminary verification results made

by our method are not completely reliable. It just provides a possible clue for

researchers who will verify the results by professional analysis of the bacteria,

such as Salmonella, Shigella and Sapovirus, examined in the patient samples

(such as anal swab and stool), through a molecular typing system.


Approaches for Local and Sporadic Outbreaks


An outbreak of foodborne disease was defined as when a group of people consume

the same contaminated food and two or more of them come down with the

same illness. According to the definition, we give a hypothesis that patients in

an outbreak caused by the same contaminated source will exhibit similar or

same symptoms. In reality, patients in LFDO are not distributed randomly, and

the temporal and spatial clusters are obvious. So we use diseased time, home

location and symptoms exhibited as features for LFDO clustering. For sporadic

outbreaks, we hope to cluster the cases which have common symptoms and

similar food information. So we use symptoms combined with the corresponding

food information as features for sporadic outbreaks clustering.


LFDO Detection

Data Preprocessed. We use cases collected between 1 January 2013 and 16

January 2014 for local outbreak detection (We named it Dataset 1). The raw

disease time is a time format and we convert it to a long type. The raw home

location is a textual address, and we use the Google Geocoding API to capture

the longitude and latitude of each address. Then we use gausskruger projection

[14] to convert spherical coordinates to plane coordinates. Longitude and latitude

correspond to the x and y axes, respectively. We separate the raw symptom text

into symptom terms. In addition, we divided all the frequency of diarrhea, such

as 5 times per day, into 4 grades: Low, Medium, High, Ultrahigh (Specifically,

0-3,4-6,7-9,10 or more than 10 corresponding to Low, Medium, High, Ultrahigh

respectively). We also divided the temperature into the same 4 grades (Specifically, 37◦ C-37.9◦ C, 38◦ C-38.9◦ C, 39◦ C-39.9◦ C, 40◦ C or above corresponding to

Low, Medium, High, Ultrahigh respectively). Besides, since all the cases contain diarrhea, we eliminate it as a stop word for each case. Then, we generate

a 0-1 vector for each symptom description. Specifically, we use symptom terms

included in all cases of a dataset as features, and the value of each feature is

0 or 1. If the case contains the symptom term, the value of the feature is 1,

otherwise 0. Fig.1(a) is a simple example of the process of generating vectors.

Finally, we combine the processed disease time, home location and symptom

Automated Detection for Probable Homologous Foodborne


into a vector as the input for DBSCAN. Fig.1(b) is an example of a combined

vector. It is also worth noting that the combined vectors need a normalization

process and the weight of these three different features should be adjusted.

Method Description. DBSCAN is a density based algorithm which discovers

clusters with arbitrary shape and with minimal number of input parameters. The

input parameters required for this algorithm are the radius of the cluster (Eps)

and minimum points required inside the cluster (Minpts). Based on combination

of the feature of DBSCAN and LFDO, DBSCAN has the following advantages

for LFDO: a) local disease clusters are arbitrary, b) the parameter Minpts enable

users to flexibly detect clusters of different sizes as required, for example, users

can set Minpts to 2 according to the definition of foodborne disease outbreak,

c) the local outbreaks maybe have different density because of the different

population density of regions, the parameter Eps enables users to detect clusters

of different density.



Case1˖{abdominal painˈ



Case2˖{fatigue, headaches,

loss of appetite}

Features˖{abdominal painˈdiarrheaˈvomitingˈfeverˈfatigue, headaches, loss of appetite}



















Case3˖{fatigue, fever}

(a).The process of generating 0-1 vectors for symptoms



abdominal painˈdiarrheaˈvomitingˈfeverˈfatigue, headaches, loss of appetite}






(b).An example of a combined vector

Fig. 1. The process of generating vectors for cases

The research on DBSCAN methods within the project tend to focus on the

practical issues of applying existing algorithms for foodborne outbreak detection

rather than on the development of new algorithms. There is one major problem

for DBSCAN applied in foodborne disease clustering. How to automatically find

a proper Eps for a specific dataset. Martin et. proposed a simple and effective

heuristic to determine a desired Eps [13]. The heuristic defined a function k-dist

from a dataset D to real numbers, mapping each point to the distance from its

k-th nearest neighbor. We can get some hints about the density attribution of

D when sorting the points of D in ascending order of their k-dist values. The

threshold point is the first point in the first “valley” of the sorted k-dist graph

(see Fig.2) and the corresponding k-dist value is the desired Eps. Also, the paper

proposed an interactive approach for determining the threshold point. The interactive approach based on a realistic assumption that a user could easily see the

valley in a graphical representation. We hope to reduce the user participation

and provide an easy-to-use approach to be integrated into the online correlation


X. Xiao et al.

analysis system for foodborne disease surveillance. This paper proposes an adaptive approach to determine the threshold point for a specific dataset without user


In Fig.2, the ideal threshold point is the one that points before it are normal

while after it are noises.Take 3-dist graph as an example, we can find the threshold point T as illustrated in Fig.3. We take the first point P1 and the last point

P2 of the 3-dist graph to determine a line L. We take the point that has the

maximum distance to L as the threshold point. Connect all the points to a curve

and the threshold point T divides the curve into two parts. The slopes of left

curves are smaller than the slope of L while that of the right curves are bigger.

Intuitively, the 3-dist values in Part I increases slowly while the 3-dist values in

Part II increase rapidly. We regard the points in part II as noises, because the

3-dist values of normal points are small and almost the same, while the 3-dist

values of noise points are big and vary a lot. This method is very intuitive, simple

and effective.








Part II



Fig. 2. The k-dist graph of Dataset 1



Part I



Fig. 3. How to find the desire point

SFDO Detection

For sporadic outbreak, the temporal and spatial clusters are not obvious. Symptoms exhibited and food information are the only and useful signals for sporadic

outbreak detection. We describe a new approach for sporadic outbreak clustering

based on pattern similarity which produces more easily interpretable and usable

clusters. This approach is motivated by the following observation: an outbreak

happen when two or more people get ill because of a same contaminated food

source. So cases in a cluster may all contain a same symptom frequent pattern

and consume a same food. Since clustering algorithms have no knowledge of

these patterns, we propose and evaluate a new clustering algorithm for symptom clustering K-Means Clustering with Pattern Similarity (K-CPS). We use

cases collected between 1 January 2013 and 9 June 2014 for sporadic outbreak

detection (We named it Dataset 2). We separate the raw symptom text to symptom terms. In addition, we divided all the frequency of diarrhea and temperature

into four grade as in Dataset 1. Besides, the term “diarrhea” is eliminated as a

Automated Detection for Probable Homologous Foodborne


stop word for each case. Most part of the food name of each case contains only

one term. There are a few cases contains two or more terms. In these cases, we

take them as one term.

Basic Concept of Frequent Patterns. We quickly review some standard

definitions of frequent patterns mining, which is a necessary and important step

for mining association rules [15]. Let I = i1 , i1 , ..., im be a set of items. Let T

be a set of transactions, where each transaction t is a set of items,t ⊂ I. The

support of an item-set X,X ⊂ I is the fraction of transactions contain X. If the

support is above a user-specified minimum, then we say that X is a frequent


K-CPS: K-Means Clustering with Pattern Similarity. In this subsection,

we describe the details of K-CPS algorithm. First, Algorithm 1 shows the pseudocode for K-CPS.

K-CPS consists of two phases. In the first phase, K-CPS computes the closed

frequent patterns. In the second phase, the K-CPS algorithm computes the similarity between frequent patterns and objects in Dataset2. We define the similarity

Algorithm 1.. K − CP SAlgorithm

Input: A dataset D; A minimum support threshold α; A denoising threshold β

Output: Clustering Result CR

Phase I

Mining the frequent patterns(F P ) of D

1: F P ← f requent pattern miner(α, D)

Screening out closed frequent patterns

2: CF P ← closed f requent pattern(FP)


Screening out the patterns

3: SCF P ← screen closed f requent pattern(CFP)

which contains both symptoms and food

Screening out maximal

4: #M axF P ← maximal f requent pattern(SCFP)

frequent patterns M axF P from SCFP, #M axF P is the number of the M axF P

Phase II

5: for i = D1 → Dm do


for j = SCF P1 → SCF Pn do


if Di contains all the terms in SCF Pj then



|SCF P |


Similarityi j ← dJAS ij = |Dii SCF Pjj | = |Di | j




Similarityi j ← 0


end if


end for

13: end for

14: D ← Denoising(β,D)

Removing the cases with every similarity is less than or

equal to a specified threshold β

15: CR ← Cluster(#MaxFP,D) Running WEKA simple K-means on the processed

dataset D

16: return CR


X. Xiao et al.

using Jaccard similarity [16]. It is defined as the quotient between the intersection and the union of the pairwise compared variables among two objects.

Equation (1) illustrates the Jaccard similarity between object X and object Y.

dJAS (X, Y ) =






After the computation, we get an n-dimension feature vector for each object.

And n is the number of the frequent patterns. Then the processed data is clustered using simple K-Means. K-CPS assigns all the objects that have similarity

symptom-food pattern to a same cluster.


Experimental Evaluation

In this section, we present an experimental evaluation of the parameter adaptive

DBSCAN and K-CPS algorithms. We use Dataset1 and Dataset2 for local and

sporadic outbreak detection respectively. Some characteristics of these two data

sets are shown in Table 1. We use cases both with and without food information

of Dataset1 for clustering and choose the cases with food information to evaluate

the effectiveness of the local outbreak detection. We choose cases with food

information of Dataset2 for sporadic outbreak detection. We don’t have any

training set or human-annotated data. So we can’t use the common evaluation

methods of clustering, such as purity, rand index (RI) and f measure [17] to

evaluate our algorithms. For our experiments, we will associate the cluster results

with food category. The local outbreaks will be statistically described by disease

time, home location, symptoms and food category. The sporadic outbreaks will

be statistically described by symptoms and food category. If the cases with a

same cluster label all relate to a same food category, then the cluster is a probable

foodborne disease outbreak. And these probable outbreaks we find are evidences

to illustrate the usefulness of our algorithms.

Table 1. Some characteristics of experimental data sets

Data set #cases

Time span

Dataset 77829


Dataset 91599








#cases with food





The Clustering Effect of Adaptive DBSCAN

In this experiment, we use the approach in section 3.1 to find an appropriate

Eps for every dataset. There are several implementation details. Firstly, time,

location and symptoms are three different types of data. We use normalization

Automated Detection for Probable Homologous Foodborne


to unify these data from different sources into a same reference frame. Specifically, we normalize the data in every dimension to an interval [0, 1]. Secondly,

since symptoms are 0-1 vectors, the difference caused by a different symptom

term between two cases is much bigger than that of time and location. As a

result, the time and location have no effect on clustering result. So we reduce

the weight of symptoms. In practice, we set the weight of each dimension of

symptoms to 1 × 10−7 , a heuristic weight derived from experiments and adjustment. Thirdly, these three different features are not equally important in local

foodborne disease outbreak. And we use a rank-order weighting method [18]to

derive a weight for each feature. By doing this, our responsibility is reduced to

ranking the features based on their importances. It is easier and more reliable

than specifying exact values. Specifically, we hold that the order of importance is

time>location>symptoms in local outbreaks. Based on the importance order, we

employ the rank-order centroid (ROC) method [18][19][20]to compute weights for

these three features. Equation (2), where wk is the weight of the k-th dimension,

generalizes weights for n features.

wk (ROC) =







, k = 1, 2, ...n



According to Equation (2), the weight of time, location and symptoms is

respectively. The location has two dimensions, the weight of each


, and the same for symptoms. If the symptoms contain m

dimension is 12 × 18



× 18

. Last but not the least,

dimensions, the weight of each dimension is m

the DBSCAN uses a global Eps for a dataset, but local outbreaks in different

provinces may have different density because of the different population density

and cases density of provinces. A global Eps can’t satisfy all provinces. To solve

this problem, we split Dataset 1 by provinces. And run each subset of Dataset 1

separately. We take the data of Anhui, Gansu, Guangxi, Henan, Hubei, Jiangsu,

Jiangxi, Sichuan, Yunnan and Zhejiang for experiments. These 10 provinces have

the most cases. Table 2 shows some statistical information of the experimental

results. The probable outbreak is hand-marked by an expert who has experience

in foodborne disease surveillance. The main basis of the hand-marking are the

following four: a) whether the disease time and location are close to each other;

b) whether the symptoms exhibited is similar or not; c) whether they are related

to a same incriminated food or not; d) whether they are infected by a same

bacteria (only very a few cases has the bacteria information). The experimental

results show our method is promising. With the adaptive Eps, DBSCAN can

effectively find all of the probable local outbreaks in the data. Rapid detection

of outbreaks is the critical first step for foodborn disease surveillance.




18 , 18 , 18


The Clustering Effect of K-CPS

We compare clustering results of K-CPS and WEKA simple K-means on Dataset2 to show the effect of K-CPS on symptoms-food clustering. We use 33435

cases which the food information are not null. For further preprocessing, we


X. Xiao et al.

Table 2. Statistics of the local outbreak detection results






#total #total #probable #cases in

cases clusters outbreaks outbreaks
















































incriminated food

kelp, roast, milk, sprouts

milk, noodle

rice, pork, mushroom, beans,


wild mushroom, breast milk

soybean milk, pork, milk

rice soup, milk

preserved egg, porridge, spiced

crispy duck

wild mushroom, grape

cake, fish, seafood, duck intestines,

watermelon, banquet food

“#total cases” is all the clinic cases collected; “#total clusters” is the number of

clusters; “#probable outbreaks” is the number of clusters that are probable outbreaks

hand-marked by an expert in CFSA;“#cases in outbreaks” is the number of cases in

probable outbreaks.

delete the cases with unclear food information, such as the terms “unknown”.

Finally, there left 21898 cases for experiments. And each case contains a symptom description and a kind of food. Note that there are some implementation

details. First, as we know, the determination of parameter k is a hard algorithmic problem [21][22]. In K-CPS, we set parameter k to the number of the

maximum frequent patterns based on the assumption that each frequent pattern represents a specific class. Second, in simple K-means, we only use the

terms (symptoms and food) which the number of occurrences are greater than

γ, γ = α × 21898 as features. And we generate a 0-1 vector for each case by using

the same way that of illustrated in Fig. 1. The parameter k is set to the same

value as in the K-CPS. Third, we use “contain” not “equal to” when we decide

whether a case contains a specified symptom term or food or not. For example,

a case C1 which consists of the following terms: abdominal pain, nausea and

frozen watermelon. And there is a frequent pattern FP1 which consists of the

following terms: abdominal pain and watermelon. We think the C1 contains the

FP1. Accordingly, in simple K-means, we think C1 contains the term watermelon. Fig.4 shows the mean entropy of food at different support thresholds of

K-CPS and simple K-means. As shown in Fig.4, the K-CPS has the outstanding

performance on clustering the same food together while balancing the similarity

of symptoms. Since the definition of foodborne outbreak is two or more people

get ill after consuming the same contaminated food. We can make an obvious

point that the K-CPS is more reasonable than simple K-means on the application of sporadic foodborne outbreak detection. It can find out the probable

sporadic outbreaks in the data.

Automated Detection for Probable Homologous Foodborne
















simple K-means


Fig. 4. The mean entropy of food at different support thresholds of K-CPS and simple




Based on the analyses mentioned above and the definition of foodborne diseases

outbreak, we give a deep insight on the characteristics of LFDO and SFDO,

which helps to find proper algorithms to detect outbreaks exist in data.

(1) In LFDO detection, the patient home location and disease time are the

most useful signals for an outbreak. The cases of a LFDO show obvious spatiotemporal aggregation. Compared to time-space features, the symptoms exhibited

is not so significant in clustering, so a weighting strategy is needed to reduce its


(2) In SFDO detection, the time-space features are no longer the indicated

information of an outbreak. As a result, we have to make the most use of symptoms exhibited. However, because the diversity among individuals, even if two

people infected of same bacteria may have different symptoms exhibited. So we

take food information into account simultaneously. Then the found outbreaks

will have higher reliability. Through combining the experiment result to the

infected bacteria(in present, very few cases have the bacteria information), we

found that patients with very similar symptoms and at the same time consume

a same type of food are very likely to be homological infection.

Note that our method focuses on probable outbreaks only. We hope to provide

effective and rapidly screening of the raw collected data for experts worked on

disease surveillance and food safety. Based on our experimental results, there

are still a lot of work need to be done. And our work is the first critical step.



The detection of foodborne disease outbreak is important for food safety and

is a complicated task at the same time. The contribution of this paper is to

Xem Thêm
Tải bản đầy đủ (.pdf) (785 trang)