Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (7.45 MB, 282 trang )
Mining Frequent Subtrees in Glycan Data Using the Rings Glycan Miner Tool
Fig. 1. An illustration of a disaccharide structure, where Galactose is linked to Glucose via
a glycosidic linkage in a beta 1–4 configuration. In this case, the “parent” would be
considered the glucose, and the “child” would be the galactose.
descendant of n. A parent of a node n is defined as the closest
ancestor of n, and a child is the closest descendant. A leaf is a
node that does not have any children. Internal nodes are nodes
that are neither leaves nor the root. An ordered tree is a tree containing nodes whose children have an ordering.
In the context of glycobiology, glycans are regarded as rooted,
labeled, ordered trees, which are hereafter simply called trees. A tree
S is a called a subtree of tree T if S consists of nodes and edges which
form a connected (and rooted) subset of the nodes and edges of T.
Conversely, T can then be called a supertree of S. The size of a tree T
refers to the number of edges in T, denoted by |T|. An immediate
supertree T of S satisfies jT j ẳ jS j ỵ 1.
2.3. Gycan Miner Tool
In order to use the Glycan Miner Tool, we next describe the text
format called KCF, which is used to describe glycan structures. This
format is currently required to specify the input glycan data.
KEGG Chemical Function format, or KCF, was first described
for use in the KEGG GLYCAN database (8). It uses a graph
concept to describe glycan structures, where nodes represent
monosaccharides and edges represent glycosidic bonds. An example of the tri-mannose N,N0 -diacetyl chitobiose core structure of
N-glycans, usually depicted as in Fig. 2, is described in KCF format
Fig. 2. The tri-mannose N,N0 -diacetyl chitobiose core structure of N-glycans.
The KCF file format must contain the NODE and EDGE
sections as a minimum. The ENTRY line is used to denote the
name or ID of the represented structure, if any. In this example, the
glycan ID is G00311. The NODE section starts with a number
indicating the number of residues being represented. The same
number of rows follows this line, each containing the following
information: residue number, residue name, and x- and ycoordinates for drawing the structure. The EDGE section follows
similarly with a number indicating the number of glycosidic bonds
in the structure, which will usually be one less than the number of
residues. The same number of rows follows, with each row containing the following information: bond number, node numbers of
residues being bound, anomeric configuration, and hydroxyl
groups in the bond. For example, the following row
3 4:a1 3:6
represents the third bond, which indicates that node number 4
(Man at position À15,5) is bound to node number 3 (Man at
position À5,0) in an a1–6 configuration.
The Resource for INformatics of Glycomes at Soka (RINGS)
Web site (Akune 2010) provides a convenient drawing tool whereby
users can draw a glycan structure and immediately obtain its KCF
format. RINGS also has several format conversion utilities such that
data that may already be in a particular format can be easily converted
Mining Frequent Subtrees in Glycan Data Using the Rings Glycan Miner Tool
to KCF format. Conversely, there are tools that can translate from
KCF to other formats, such as LinearCode® (9) and as image files.
2.4. Glycan Databases
2.4.1. KEGG GLYCAN
The KEGG GLYCAN database is available at http://www.genome.
jp/kegg/glycan/ and is a database of glycan structures, accumulated from the original CarbBank database (8). It has since been
refined and updated with structures from the literature. All the data
is freely available from the Web and can also be accessed via an
application programming interface (API), which is described at
2.4.2. Consortium for
The Consortium for Functional Glycomics (CFG) glycan structure
database was originally developed as a part of the bioinformatics core
of the CFG. In addition to their own database of glycan-binding
proteins (GBP) and glycosyltransferases, they had initially accumulated N- and O-linked glycans from the CarbBank database as well as
the glycan structure data from GlycoMinds, Ltd. Since then, they have
added their own synthesized glycans from their glycan array library
and characterized glycans from their tissue and cell profiling data.
All of the glycan profiling data, glycan array data, and knockout
mouse data generated by the CFG are also available as data
resources over the Web. Glycan profiling data consists of the mass
spectral data for various human and mouse tissue samples, which
have been annotated using Cartoonist. The glycan array data consist of binding affinity information of various glycans for different
GBP, viruses, bacteria, etc. Each data set focuses on a particular
GBP or other binder and lists the binding affinity for each glycan
structure on the array. Glycan arrays have developed over the years,
and the latest version contains over 600 glycan structures (10).
In addition to KEGG GLYCAN and CFG, there are several glycan
structure databases available around the world. GlycomeDB was
developed to serve as a portal to all of these structure databases via a
single interface. It was a major project that entailed with integration
of all the structures in the various databases, which was complicated
mostly due to the inconsistent naming conventions of monosaccharides. As a result, this database was able to integrate seven
different glycan databases and includes over 30,000 distinct glycan
structures. The URL is http://www.glycome-db.org/ (11).
3.1. a-Closed Frequent
Given a set of trees D, the support of a subtree S is the number of
trees in D that contain S as a subtree, which we denote by support
(S). A frequent subtree is thus the subtree whose support is larger
than or equal to a given threshold value denoted by minsup.
A maximal frequent subtree is a frequent subtree, none of
Fig. 3. The Glycan Miner Tool where default values have been supplied.
whose proper supertrees are frequent. A subtree is closed if none of
its proper supertrees has the same support as it has. With the
concept of maximal and closed frequent subtrees defined, we can
then formally define an a-closed frequent subtree as a frequent
subtree S, none of whose frequent supertrees has support greater
than or equal to a fraction a of support(S).
Given a data set of trees, the mining of a-closed frequent
subtrees from within this data set entails the enumeration of all
possible subtrees and then determining their support values. This is
in fact a difficult problem because the enumeration of all possible
subtrees can grow exponentially. However, by considering the
support of supertrees as subtrees are enumerated, potential subtrees can be pruned so that frequent subtrees can be found efficiently. Because the details of this method are beyond the scope of
this chapter, interested readers are referred to (1).
3.2. Gycan Miner Tool
The Glycan Miner Tool is available in the RINGS resource at
http://www.rings.t.soka.ac.jp/ (12). By registering as a user, any
uploaded data is automatically saved in the user data space, which is
password protected. Thus all executions of any program on RINGS
can be recorded and managed by the user.
The input to the Glycan Miner Tool is a list of glycan structures
in KCF format. There are also two parameters, minsup and alpha,
which take on a value between zero and one. A screenshot of the
Glycan Miner Tool where default values have been supplied is
illustrated in Fig. 3. Given these inputs, the tool then computes