1. Trang chủ >
  2. Công Nghệ Thông Tin >
  3. Kỹ thuật lập trình >

2 α-Closed Frequent Subtree Mining

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (7.45 MB, 282 trang )


Mining Frequent Subtrees in Glycan Data Using the Rings Glycan Miner Tool


Fig. 1. An illustration of a disaccharide structure, where Galactose is linked to Glucose via

a glycosidic linkage in a beta 1–4 configuration. In this case, the “parent” would be

considered the glucose, and the “child” would be the galactose.

descendant of n. A parent of a node n is defined as the closest

ancestor of n, and a child is the closest descendant. A leaf is a

node that does not have any children. Internal nodes are nodes

that are neither leaves nor the root. An ordered tree is a tree containing nodes whose children have an ordering.

In the context of glycobiology, glycans are regarded as rooted,

labeled, ordered trees, which are hereafter simply called trees. A tree

S is a called a subtree of tree T if S consists of nodes and edges which

form a connected (and rooted) subset of the nodes and edges of T.

Conversely, T can then be called a supertree of S. The size of a tree T

refers to the number of edges in T, denoted by |T|. An immediate

supertree T of S satisfies jT j ẳ jS j ỵ 1.

2.3. Gycan Miner Tool

In order to use the Glycan Miner Tool, we next describe the text

format called KCF, which is used to describe glycan structures. This

format is currently required to specify the input glycan data.

KEGG Chemical Function format, or KCF, was first described

for use in the KEGG GLYCAN database (8). It uses a graph

concept to describe glycan structures, where nodes represent

monosaccharides and edges represent glycosidic bonds. An example of the tri-mannose N,N0 -diacetyl chitobiose core structure of

N-glycans, usually depicted as in Fig. 2, is described in KCF format

as follows:


K.F. Aoki-Kinoshita

Fig. 2. The tri-mannose N,N0 -diacetyl chitobiose core structure of N-glycans.









































The KCF file format must contain the NODE and EDGE

sections as a minimum. The ENTRY line is used to denote the

name or ID of the represented structure, if any. In this example, the

glycan ID is G00311. The NODE section starts with a number

indicating the number of residues being represented. The same

number of rows follows this line, each containing the following

information: residue number, residue name, and x- and ycoordinates for drawing the structure. The EDGE section follows

similarly with a number indicating the number of glycosidic bonds

in the structure, which will usually be one less than the number of

residues. The same number of rows follows, with each row containing the following information: bond number, node numbers of

residues being bound, anomeric configuration, and hydroxyl

groups in the bond. For example, the following row

3 4:a1 3:6

represents the third bond, which indicates that node number 4

(Man at position À15,5) is bound to node number 3 (Man at

position À5,0) in an a1–6 configuration.

The Resource for INformatics of Glycomes at Soka (RINGS)

Web site (Akune 2010) provides a convenient drawing tool whereby

users can draw a glycan structure and immediately obtain its KCF

format. RINGS also has several format conversion utilities such that

data that may already be in a particular format can be easily converted


Mining Frequent Subtrees in Glycan Data Using the Rings Glycan Miner Tool


to KCF format. Conversely, there are tools that can translate from

KCF to other formats, such as LinearCode® (9) and as image files.

2.4. Glycan Databases



The KEGG GLYCAN database is available at http://www.genome.

jp/kegg/glycan/ and is a database of glycan structures, accumulated from the original CarbBank database (8). It has since been

refined and updated with structures from the literature. All the data

is freely available from the Web and can also be accessed via an

application programming interface (API), which is described at


2.4.2. Consortium for

Functional Glycomics

The Consortium for Functional Glycomics (CFG) glycan structure

database was originally developed as a part of the bioinformatics core

of the CFG. In addition to their own database of glycan-binding

proteins (GBP) and glycosyltransferases, they had initially accumulated N- and O-linked glycans from the CarbBank database as well as

the glycan structure data from GlycoMinds, Ltd. Since then, they have

added their own synthesized glycans from their glycan array library

and characterized glycans from their tissue and cell profiling data.

All of the glycan profiling data, glycan array data, and knockout

mouse data generated by the CFG are also available as data

resources over the Web. Glycan profiling data consists of the mass

spectral data for various human and mouse tissue samples, which

have been annotated using Cartoonist. The glycan array data consist of binding affinity information of various glycans for different

GBP, viruses, bacteria, etc. Each data set focuses on a particular

GBP or other binder and lists the binding affinity for each glycan

structure on the array. Glycan arrays have developed over the years,

and the latest version contains over 600 glycan structures (10).

2.4.3. GlycomeDB

In addition to KEGG GLYCAN and CFG, there are several glycan

structure databases available around the world. GlycomeDB was

developed to serve as a portal to all of these structure databases via a

single interface. It was a major project that entailed with integration

of all the structures in the various databases, which was complicated

mostly due to the inconsistent naming conventions of monosaccharides. As a result, this database was able to integrate seven

different glycan databases and includes over 30,000 distinct glycan

structures. The URL is http://www.glycome-db.org/ (11).

3. Methods

3.1. a-Closed Frequent

Subtree Mining

Given a set of trees D, the support of a subtree S is the number of

trees in D that contain S as a subtree, which we denote by support

(S). A frequent subtree is thus the subtree whose support is larger

than or equal to a given threshold value denoted by minsup.

A maximal frequent subtree is a frequent subtree, none of


K.F. Aoki-Kinoshita

Fig. 3. The Glycan Miner Tool where default values have been supplied.

whose proper supertrees are frequent. A subtree is closed if none of

its proper supertrees has the same support as it has. With the

concept of maximal and closed frequent subtrees defined, we can

then formally define an a-closed frequent subtree as a frequent

subtree S, none of whose frequent supertrees has support greater

than or equal to a fraction a of support(S).

Given a data set of trees, the mining of a-closed frequent

subtrees from within this data set entails the enumeration of all

possible subtrees and then determining their support values. This is

in fact a difficult problem because the enumeration of all possible

subtrees can grow exponentially. However, by considering the

support of supertrees as subtrees are enumerated, potential subtrees can be pruned so that frequent subtrees can be found efficiently. Because the details of this method are beyond the scope of

this chapter, interested readers are referred to (1).

3.2. Gycan Miner Tool

The Glycan Miner Tool is available in the RINGS resource at

http://www.rings.t.soka.ac.jp/ (12). By registering as a user, any

uploaded data is automatically saved in the user data space, which is

password protected. Thus all executions of any program on RINGS

can be recorded and managed by the user.

The input to the Glycan Miner Tool is a list of glycan structures

in KCF format. There are also two parameters, minsup and alpha,

which take on a value between zero and one. A screenshot of the

Glycan Miner Tool where default values have been supplied is

illustrated in Fig. 3. Given these inputs, the tool then computes

Xem Thêm
Tải bản đầy đủ (.pdf) (282 trang)