Clustering the Tagged Web

Abstract

Automatically clustering web pages into semantic groups promises improved search and browsing on the web. In this paper, we demonstrate how user-generated tags from largescale social bookmarking websites such as del.icio.us can be used as a complementary data source to page text and anchor text for improving automatic clustering of web pages. This paper explores the use of tags in 1) K-means clustering in an extended vector space model that includes tags as well as page text and 2) a novel generative clustering algorithm based on latent Dirichlet allocation that jointly models text and tags. We evaluate the models by comparing their output to an established web directory. We find that the naive inclusion of tagging data improves cluster quality versus page text alone, but a more principled inclusion can substantially improve the quality of all models with a statistically signifi- cant absolute F-score increase of 4%. The generative model outperforms K-means with another 8% F-score increase.

Other Versions

No versions found

Links

PhilArchive



    Upload a copy of this work     Papers currently archived: 101,337

External links

Setup an account with your affiliations in order to access resources via your University's proxy server

Through your library

  • Only published works are available at libraries.

Similar books and articles

確率的 Web 画像収集.Yanai Keiji - 2007 - Transactions of the Japanese Society for Artificial Intelligence 22 (1):10-18.
画像検索のための Web テキストによる画像クラスタリング.Nagata Akiko Sunayama Wataru - 2004 - Transactions of the Japanese Society for Artificial Intelligence 19:580-588.

Analytics

Added to PP
2010-12-22

Downloads
71 (#295,739)

6 months
8 (#583,676)

Historical graph of downloads
How can I increase my downloads?