1. Trang chủ >
  2. Luận Văn - Báo Cáo >
  3. Công nghệ thông tin >

Chapter 3. Topics Analysis of Large-Scale Web Dataset

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.22 MB, 67 trang )


25

consist of a vowel plus semivowel. There are two of these semivowels: y (written i or y).

A majority of diphthongs in Vietnamese are formed this way.

Furthermore, these semivowels may also follow the first three diphthongs (“ia”, “ua”,

“ưa”) resulting in tripthongs.

b. Tones

Vietnamese vowels are all pronounced with an inherent tone. Tones differ in pitch,

length, contour melody, intensity, and glottal (with or without accompanying constricted

vocal cords)

Tone is indicated by diacritics written above or below the vowel (most of the tone

diacritics appear above the vowel; however, the “nặng” tone dot diacritic goes below the

vowel). The six tones in Vietnamese are:

Table 3.2. Tones in Vietnamese



c. Consonants

The consonants of the Hanoi variety are listed in the Vietnamese orthography, except for

the bilabial approximant which is written as “w” (in the writing system it is written the

same as the vowels “o” and “u”

Some consonant sounds are written with only one letter (like “p”), other consonant sounds

are written with a two-letter digraph (like “ph”), and others are written with more than

one letter or digraph (the velar stop is written variously as “c”, “k”, or “q”).



26

Table 3.3. Consonants of hanoi variety



3.1.2. Syllable Structure

Syllables are elementary units that have one way of pronunciation. In documents, they are

usually delimited by white-space. In spite of being the elementary units, Vietnamese

syllables are not undividable elements but a structure. Table 3.4 depicts the general

structure of Vietnamese syllable:

Table 3.4. Structure of Vietnamese syllables



First

Consonant



TONE MARK

Rhyme

Secondary

Main

Consonant

Vowel



Last

Consonant



Generally, each Vietnamese syllable has all five parts: first consonant, secondary vowel,

main vowel, last consonant and a tone mark. For instance, the syllable “tuần” (week) has

a tone mark (grave accent), a first consonant (t), a secondary vowel (u), a main vowel (â)

and a last consonant (n). However, except for main vowel that is required for all syllables,

the other parts may be not present in some cases. For example, the syllable “anh”

(brother) has no tone mar, no secondary vowel and no first consonant. Another example is

the syllable “hoa” (flower) has a secondary vowel (o) but no last consonant.

3.1.3. Vietnamese Word

Vietnamese is often erroneously considered to be a "monosyllabic" language. It is true

that Vietnamese has many words that consist of only one syllable; however, most words

indeed contain more than one syllable.

Based on the way of constructing words from syllables, we can classify them into three

classes: single words, complex words and reduplicative words. Each single word has only



27

one syllable that implies specific meaning. For example: “tôi” (I), “bạn” (you), “nhà”

(house), etc. Words that involve more than one syllable are called “complex word”. The

syllables in complex words are combined based on semantic relationships which are

either coordinated (“bơi lội” – swim) or “principle and accessory” (“đường sắt” –railway).

A word is considered as a reduplicative word if its syllables have phonic components

(Table 3.4) reduplicated, for instance: “đùng đùng” (full-reduplicative), “lung linh” (first

consonant reduplicated), etc. This type of words is usually used for scene or sound

descriptions particularly in the literary.



3.2. Preprocessing and Transformation

Data preprocessing and Transformation are necessary steps for any data mining process in

general and for hidden topics mining in particular. After these steps, data is clean,

complete, reduced, partially free of noises, and ready to be mined. The main steps for our

preprocessing and transformation are described in the subsequent sections and shown in

the following chart:



Figure 3.1. Pipeline of Data Preprocessing and Transformation



3.2.1. Sentence Segmentation

Sentence segmentation is to determine whether a ‘sentence delimiter’ is really a sentence

boundary. Like English, sentence delimiters in Vietnamese are full-stop, the exclamation

mark and the question mark (.!?). The exclamation mark and the question mark do not

really pose the problems. The critical element is again the period: (1) the period can be a

sentence-ending character (full stop); (2) the period can denote an abbreviation; (3) the

period can used in some expressions like URL, Email, numbers, etc.; (4) in some cases, a

period can assume both (1) and (2) functions.

Given an input string, the result of this detector are sentences, each of which is in one

line. Then, this output is shifted to the sentence tokenization step.



28

3.2.2. Sentence Tokenization

Sentence tokenization is the process of detaching marks from words in a sentence. For

example, we would like to detach “,” from its previous word.

3.2.3. Word Segmentation

As mentioned in Section 3.1. , Vietnamese words are not always determined by whitespaces due to the fact that each word can contain more than one syllable. This gives birth

to the task of word segmentation, i.e. segment a sentence into a sequence of words.

Vietnamese word segmentation is a perquisite for any further processing and text mining.

Though being quite basic, it is not a trivial task because of the following ambiguities:

-



Overlapping ambiguity: String αβγ were called overlapping ambiguity when both

αβ and βγ are valid Vietnamese word. For example: “học sinh học sinh học”



(Student studies biology)

“học sinh” (student) and “sinh học” (biology) are

found in Vietnamese dictionary.

-



Combination ambiguity: String αβγ were called combination ambiguity when

( α , β , αβ ) are possible choices. For instance: “bàn là một dụng cụ” (Table is a tool)

“bàn” (Table), “bàn là” (iron), “là” (is) are found in Vietnamese dictionary.



In this work, we used Conditional Random Fields approach to segment Vietnamese

words[31] . The outputs of this step are sequences of syllables joined to form words.

3.2.4. Filters

After word segmentation, tokens now are separated by white-space. Filters remove trivial

tokens for analyzing process, i.e. tokens for number, date/time, too-short tokens (length is

less than 2 characters). Too short sentences, English sentences, or Vietnamese sentences

without tones (The Vietnamese sometimes write Vietnamese text without tone) also

should be filtered or manipulated in this phrase.

3.2.5. Remove Non Topic-Oriented Words

Non Topic-Oriented Words are those we consider to be trivial for topic analyzing process.

These words can cause much noise and negative effects for our analysis. Here, we treat

functional words, too rare or too common words as non topic-oriented words. See the

following table for more details about functional words in Vietnamese:



29

Table 3.5. Functional words in Vietnamese



Part of Speech (POS)



Examples



Classifier Noun



cái, chiếc, con, bài, câu, cây, tờ, lá, việc



Major/Minor conjunction



Bởi chưng, bởi vậy, chẳng những, …



Combination Conjunction



Cho, cho nên, cơ mà, cùng, dẫu, dù, và



Introductory word



Gì, hẳn, hết, …



Numeral



Nhiều, vô số, một, một số, …



Pronoun



Anh ấy, cô ấy, …



Adjunct



Sẽ, sắp sửa, suýt, …



3.3. Topic Analysis for VnExpress Dataset

We collect a large dataset from VnExpress [47] using Nutch [36] and then do

preprocessing and transformation. The statistics of the topics assigned by humans and

other parameters of the dataset are shown in the tables below:

Table 3.6. Statistics of topics assigned by humans in VnExpress Dataset



Society: Education, Entrance Exams, Life of Youths …

International: Analysis, Files, Lifestyle …

Business: Business man, Stock, Integration …

Culture: Music, Fashion, Stage – Cinema …

Sport: Football, Tennis

Life: Family, Health …

Science: New Techniques, Natural Life, Psychology



And Others …



Note that information about topics assigned by humans is just listed here for reference and

not used in the topic analysis process. After data preprocessing and transformation, we get

53M data (40,268 documents, 257,533 words; vocabulary size of 128,768). This data is

put into GibbLDA++ [38] – a tool for Latent Dirichlet Allocation using Gibb Sampling

(see Section 1.3. ). The results of topic analysis with K = 100 topics are shown in



30

Table 3.5.

Table 3.7. Statistics of VnExpress dataset



After removing html, doing sentence and word segmentation:

size ≈ 219M, number of docs = 40,328

After filtering and removing non-topic oriented words:

size ≈ 53M, number of docs = 40,268

number of words = 5,512,251; vocabulary size = 128,768

Table 3.8 Most likely words for sample topics. Here, we conduct topic analysis with 100 topics.

Topic 1



Tòa (Court)

Điều tra (Investigate)

Luật sư (Lawyer)

Tội (Crime)

Tòa án (court)

Kiện (Lawsuits)

Buộc tội (Accuse)

Xét xử (Judge)

Bị cáo (Accused)

Phán quyết (Sentence)

Bằng chứng (Evidence)

Thẩm phán (Judge)



Topic 3



0.0192

0.0180

0.0162

0.0142

0.0108

0.0092

0.0076

0.0076

0.0065

0.0060

0.0046

0.0050



Topic 9



Du lịch (Tourism)

Khách (Passengers)

Khách sạn (Hotel)

Du khách (Tourists)

Tour

Tham quan (Visit)

Biển (Sea)

Chuyến đi (Journey)

Giải trí (Entertainment)

Khám phá (Discovery)

Lữ hành (Travel)

Điểm đến (Destination)



Topic 7



Trường (School)

0.0660

Lớp (Class)

0.0562

Học sinh (Pupil)

0.0471

Giáo dục (Education)

0.0192

Dạy (Teach)

0.0183

Giáo viên (Teacher)

0.0179

Môn (Subject)

0.0080

Tiểu học (Primary school)0.0070

Hiệu trưởng (Rector)

0.0067

Trung học (High school) 0.0064

Tốt nghiệp (Graduation) 0.0063

Năm học (Academic year)0.0062



Game

Trò chơi (Game)

Người chơi (Gamer)

Nhân vật (Characters)

Online

Giải trí (Entertainment)

Trực tuyến (Online)

Phát hành (Release)

Điều khiển (Control)

Nhiệm vụ (Mission)

Chiến đấu (Fight)

Phiên bản (Version)



Topic 14



0.0542

0.0314

0.0276

0.0239

0.0117

0.0097

0.0075

0.0050

0.0044

0.0044

0.0039

0.0034



Thời trang (Fashion)

Người mẫu (Models)

Mặc (Wear)

Mẫu (Sample)

Trang phục (Clothing)

Đẹp (Nice)

Thiết kế (Design)

Sưu tập (Collection)

Váy (Skirt)

Quần áo (Clothes)

Phong cách (Styles)

Trình diễn (Perform)



0.0869

0.0386

0.0211

0.0118

0.0082

0.0074

0.0063

0.0055

0.0052

0.0041

0.0038

0.0038



Topic 15



0.0482

0.0407

0.0326

0.0305

0.0254

0.0249

0.0229

0.0108

0.0105

0.0092

0.0089

0.0051



Bóng đá (Football)

0.0285

Đội (Team)

0.0273

Cầu thủ (Football Players)0.0241

HLV (Coach)

0.0201

Thi đấu (Compete)

0.0197

Thể thao (Sports)

0.0176

Đội tuyển (Team)

0.0139

CLB (Club)

0.0138

Vô địch (Championship) 0.0089

Mùa (Season)

0.0063

Liên đoàn (Federal)

0.0056

Tập huấn (Training)

0.0042



3.4. Topic Analysis for Vietnamese Wikipedia Dataset

The second dataset is collected from Vietnamese Wikipedia, and contains D=29,043

documents. We preprocessed this dataset in the same way described in the Section 3.2.

This led to a vocabulary size of V = 63,150, and a total of 4,784,036 word tokens. In the



31

hidden topic mining phrase, the number of topics K was fixed at 200. The hyperparameters α and β were set at 0.25 and 0.1 respectively.

Table 3.9. Statistic of Vietnamese Wikipedia Dataset

After removing html, doing sentence and word segmentation:



size ≈ 270M, number of docs = 29,043



After filtering and removing non-topic oriented words:



size ≈ 48M, number of docs = 17,428

number of words = 4,784,036; vocabulary size = 63,150



Table 3.10 Most likely words for sample topics. Here, we conduct topic analysis with 200 topics

Topic 2



Tàu (Ship)

Hải quân (Navy)

Hạm đội (Fleet)

Thuyền (Ship)

Đô đốc (Admiral)

Tàu chiến (Warship)

Cảng (Harbour)

Tấn công (Attack)

Lục chiến (Marine)

Thủy quân (Seaman)

Căn cứ (Army Base)

Chiến hạm (Gunboat)



Topic 5



0.0527

0.0437

0.0201

0.0100

0.0097

0.0092

0.0086

0.0081

0.0075

0.0067

0.0066

0.0058



Topic 8



Nguyên tố (Element)

Nguyên tử (Atom)

Hợp chất (Compound)

Hóa học (Chemical)

Đồng vị (Isotope)

Kim loại (Metal)

Hidro (Hidro)

Phản ứng (Reaction)

Phóng xạ (Radioactivity)

Tuần hoàn (Circulation)

Hạt nhân (Nuclear)

Điện tử (Electronics)



Độc lập (Independence) 0.0095

Lãnh đạo (Lead)

0.0088

Tổng thống (President) 0.0084

Đất nước (Country)

0.0070

Quyền lực (Power)

0.0069

Dân chủ (Democratic) 0.0068

Chính quyền (Government)0.0067

Ủng hộ (Support)

0.0065

Chế độ (System)

0.0063

Kiểm soát (Control)

0.0058

Lãnh thổ (Territory)

0.0058

Liên bang (Federal)

0.0051

Topic 9



0.0383

0.0174

0.0172

0.0154

0.0149

0.0148

0.0142

0.0123

0.0092

0.0086

0.0078

0.0076



Trang (page)

0.0490

Web (Web)

0.0189

Google (Google)

0.0143

Thông tin (information) 0.0113

Quảng cáo(advertisement)0.0065

Người dùng(user)

0.0058

Yahoo (Yahoo)

0.0054

Internet (Internet)

0.0051

Cơ sở dữ liệu (database) 0.0044

Rss (RSS)

0.0041

HTML (html)

0.0039

Dữ liệu (data)

0.0038



Topic 6



Động vật (Animal)

0.0220

Chim (Bird)

0.0146

Lớp (Class)

0.0123

Cá sấu (Crocodiles)

0.0116

Côn trùng (Insect)

0.0113

Trứng (Eggs)

0.0093

Cánh (Wing)

0.0092

Vây (Fin)

0.0077

Xương (Bone)

0.0075

Phân loại (Classify)

0.0054

Môi trường (Environment)0.0049

Xương sống (Spine)

0.0049

Topic 17



Lực (Force)

Chuyển động (Move)

Định luật (Law)

Khối lượng (Mass)

Quy chiếu (Reference)

Vận tốc (Velocity)

Quán tính (Inertia)

Vật thể (Object)

Newton (Newton)

Cơ học (Mechanics)

Hấp dẫn (Attractive)

Tác động (Influence)



0.0487

0.0323

0.0289

0.0203

0.0180

0.0179

0.0173

0.0165

0.0150

0.0149

0.0121

0.0114



3.5. Discussion

The hidden topics analysis using LDA for both VnExpress and Vietnamese Wikipedia

datasets have shown satisfactory results. While VnExpress dataset is more suitable for

daily life topic analysis, Vietnamese Wikipedia dataset is good for scientific topic



32

modeling. The decision of which one is suitable for a task depends much on its domain of

application.

From experiments, it can be seen that the number of topics should be appropriate to the

nature of dataset and the domain of application. If we choose a large number of topics, the

analysis process can generate a lot of topics which are too close (in the semantic) to each

others. On the other hand, if we assign a small number of topics, the results can be too

common. Hence, the learning process can benefits less from this topic information.

When conducting topic analysis, one should consider data very carefully. Preprocessing

and transformation are important steps because noise words can cause negative effects. In

Vietnamese, focus should be made on word segmentation, stop words filter. Also,

common personal names in Vietnamese should be removed. In other cases, it is necessary

to either remove all Vietnamese sentences written without tones (this writing style is quite

often in online data in Vietnamese) or do tone recovery for them. Other considerations

also should be made for Vietnamese Identification or Encoding conversions, etc., due to

the complex variety of online data.



3.6. Summary

This chapter summarized major issues for topics analysis of 2 specific datasets in

Vietnamese. We first reviewed some characteristics in Vietnamese. These considerations

are significant for dataset preprocessing and transformation in the subsequent processes.

We then described each step of preprocessing and transforming data. Significant notes,

including specific characteristics of Vietnamese, are also highlighted. In the last part, we

demonstrated the results from topics analysis using LDA for some dataset in Vietnamese.

The results showed that LDA is a potential method for topics analysis in Vietnamese.



33



Chapter 4. Deployments of General Frameworks

This chapter goes further into details of the deployments of general frameworks for the

two tasks: classification and clustering for Vietnamese Web Search Results. Evaluation

and Analysis for our proposals are also considered in the next subsections.



4.1. Classification with Hidden Topics

4.1.1. Classification Method



Figure 4.1. Classification with VnExpress topics



The objective of classification is to automatically categorize new coming documents into

one of k classes. Given a moderate training dataset, an estimated topic model and k

classes, we would like to build a classifier based on the framework in Figure 4.1. Here,

we use the model estimated from VnExpress dataset with LDA (see section 3.3. for more

details). In the following subsections, we will discuss more about important issues of this

deployment.

a. Data Description

For training and testing data, we first submit queries to Google and get results through

Google API [19]. The number of query phrases and snippets in each train and test dataset

are shown in Table 4.1 Google search results as training and testing dataset.

The search phrases for training and test data are designed to be exclusive. Note that, the

training and testing data here are designed to be as exclusive as possible.

b. Combining Data with Hidden Topics

The outputs of topic inference for train/new data are topic distributions, each of which

corresponds to one snippet. We now have to combine each snippet with its hidden topics.



34

This can be done by a simple procedure in which the occurrence frequency of a topic in

the combination depends on its probability. For example: a topic with probability greater

than 0.03 and less than 0.05 have 2 occurrences, while a topic with probability less than

0.01 is not included in the combination. One demonstrated example is shown in Figure

4.2.

Table 4.1 Google search results as training and testing dataset.

The search phrases for training and test data are designed to be exclusive



Training dataset

Domains



Testing dataset



#phrases #snippets #phrases #snippets



Business



50



1.479



9



270



Culture-Arts



49



1.350



10



285



Health



45



1.311



8



240



Laws



52



1.558



10



300



Politics



32



957



9



270



Science –

Education



41



1.229



9



259



Life-Society



19



552



8



240



Sports



45



1.267



9



223



Technologies



51



1.482



9



270



c. Maximum Entropy Classifier

The motivating idea behind maximum entropy [34][35] is that one should prefer the most

uniform models that also satisfy any given constraints. For example, consider a four-class

text classification task where we told only that on average 40% documents with the word

“professor” in them are in the faculty class. Intuitively, when given a document with

“professor” in it, we would say it has a 40% chance of being a faculty document, and a

20% chance for each of the other three classes. If a document does not have “professor”

we would guess the uniform class distribution, 25% each. This model is exactly the

maximum entropy model that conforms to our known constraint.

Although maximum entropy can be used to estimate any probability distribution, we only

consider here the classification task; thus we limit the problem to learning conditional

distributions from labeled training data. Specifically, we would like to learn the

conditional distribution of the class label given a document.



Xem Thêm
Tải bản đầy đủ (.pdf) (67 trang)

×