This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
I N V I T E D
P A P E R
Contextual Internet
Multimedia Advertising
By Tao Mei, Member IEEE, and Xian-Sheng Hua, Member IEEE
ABSTRACT | The advent of media-sharing sites has led to the
unprecedented Internet delivery of community-contributed
media like images and videos. Those visual contents have
become the primary sources for online advertising. Conven-
tional advertising treats multimedia advertising as general text
advertising by displaying advertisements either relevant to the
queries or the Web pages, without considering the potential
advantages which could be brought by media contents. In this
Internet multimedia
paper, we summarize the trend of
advertising and conduct a broad survey on the methodologies
for advertising which are driven by the rich contents of images
and videos. We discuss three key problems in a generic
multimedia advertising framework. These problems are: con-
textual relevance that determines the selection of relevant
advertisements, contextual intrusiveness which is the key to
detect appropriate ad insertion positions within an image or
video, and insertion optimization that achieves the best
association between the advertisements and insertion posi-
tions so that the effectiveness of advertising can be maximized
in terms of both contextual relevance and contextual intru-
siveness. We show recently developed MediaSense which
consists of image, video, and game advertising as an exemplary
application of contextual multimedia advertising.
In the
MediaSense, the most contextually relevant ads are embedded
at the most appropriate positions within images or videos. To
this end, techniques in computer vision, multimedia retrieval,
and computer human interaction are leveraged. We also
envision that the next trend of multimedia advertising would
be game-like advertising which is more impressionative and
thus can promote advertising in an interactive, as well as more
Manuscript received April 5, 2009.
The authors are with the Microsoft Research Asia, Beijing 100190, China
(e-mail: {tmei, xshua}@microsoft.com).
Digital Object Identifier: 10.1109/JPROC.2009.2039841
compelling and effective way. We conclude this survey with a
brief outlook on open research directions.
KEYWORDS | Computer vision; contextual advertising; multi-
media advertising; survey
I. INTRODUCTION
the amount of digital
The proliferation of digital capture devices and the
explosive growth of online social media (especially along
with the so called Web 2.0 wave) have led to the countless
private image and video collections on local computing
devices, such as personal computers, cell phones, and
personal digital assistants (PDAs), as well as the huge yet
increasing public media collections on the Internet [8].
images captured
For example,
worldwide in 2011 will increase from 50 billion in 2007
to 60 billion according to IT Facts’ report [44]. The most
popular photo sharing site,
i.e., Flickr, reached three
billion photo uploads at the end of 2008 and 3–5 million
new photos uploaded daily [53], [85], while Youtube drew
5 billion U.S. online video views in July 2008 [118]. On the
other hand, we have witnessed a fast and consistently
in recently years.
growing online advertising market
Jupiter Research forecasted that online advertising spend-
ing will surge to $18.9 billion by 2010-up, which is about
59 percent from an estimated $11.9 billion in 2005 [50].
Motivated by the huge business opportunities in the online
advertising market, people are now actively investigating
new Internet advertising models. To take the advantages of
the visual form of information representation, multimedia
advertising, which associates advertisements with an
online image or video, has become an emerging online
monetization strategy.1
1Please note that Bmultimedia advertising[ and Badvertising multi-
media[ are two different concepts. By multimedia advertising, we refer to
the process of associating advertisements with multimedia, while advertis-
ing multimedia indicates using multimedia as the form of advertisement.
0018-9219/$26.00 (cid:2)2010 IEEE
Vol. 0, No. 0, 2010 | Proceedings of the IEEE 1
Authorized licensed use limited to: MICROSOFT. Downloaded on June 29,2010 at 10:30:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Mei and Hua: Contextual Internet Multimedia Advertising
In an advertising system, there is usually an interme-
diary commercial ad-network entity (i.e., a service
provider between the publisher and advertiser) in charge
of optimizing the ad selection and displaying with the twin
goal of increasing the revenue (shared between the pub-
lisher and ad-network) and improving user experience.
With these goals, it is preferable to the publishers and
profitable to the advertisers to have ads relevant to media
content rather than generic ads. By implementing a solid
multimedia advertising strategy into an existing content
delivery chain, both the publishers and advertisers have
the ability to deliver compelling content, reach a growing
online audience, and eventually generate additional reve-
nue from online media.
Advertising has embarked on a dramatic evolution,
which will be rapid, fundamental, and permanent. Al-
though this evolution is still underway in advertising in
terms of objectives, strategy, and solutions, we can sum-
marize the trends of Internet advertising into two genera-
tions in terms of methodologies: conventional advertising
and contextual advertising. Fig. 1 shows the evolution of
advertising using text and media (such as image, video, and
audio) as information carriers for advertising, respectively.
The conventional text-based advertising, i.e., the first gen-
eration in the leftmost of Fig. 1, is characterized by de-
livering ads at certain positions on Web pages which are
relevant to either the queries or Web page content. In this
generation, paid search and display advertising are the
main strategies which support a Blong-tail[ and Bhead[
business model, respectively. For example, Google’s Ad-
words [2] and AdSense [1] are successful paid search ad-
vertising platforms, while DoubleClick [24] and Yahoo!
[112] have predominantly focused on the latter. From the
perspective of research, the rich research in the first
generation has proceeded along
three dimensions
from the perspective of what the ads are matched against:
1) keyword-targeted advertising (also called Bpaid search
advertising[ or Bsponsored search[) in which the ads are
matched against the originating queries [49], [75], [106],
2) content-targeted advertising (also called Bcontextual
Fig. 1. Trend of Internet advertising in the text and media domain. The online advertising can be summarized as
two generations in different information carries of advertising: the first generation is conventional advertising which
embeds relevant ads at fixed positions, while the second is contextual advertising embeds ads at automatically detected
positions within page and media.
2 Proceedings of the IEEE | Vol. 0, No. 0, 2010
Authorized licensed use limited to: MICROSOFT. Downloaded on June 29,2010 at 10:30:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Mei and Hua: Contextual Internet Multimedia Advertising
advertising[)2 in which the ads are associated with the
Web page content rather than the keywords [4], [11], [54],
[83], [89], and (3) user-targeted advertising (also called
Baudience intelligence[) in which the ads are driven based
on user profile and demography [37], comments [6], or
behaviour [14], [16], [22], [90] . The advertisements in the
first generation are typically embedded at certain pre-
served blocks in the Web pages. While conventional
advertising primarily embeds ads around the content, the
second generation of text advertisingVcontextual adver-
tising, aims to deliver relevant ads inside the content. For
example, Vibrant Media [100] associates relevant ads with
certain keywords or paragraphs within a Web page and
embeds these ads in-text.
Using text-based advertising as reference, we can figure
out online advertising using media as carrier as two gen-
erations, including conventional and contextual advertis-
ing [40], shown in the rightmost of Fig. 1. Similar to text,
the first generation of multimedia advertising directly
applies text-based advertising approaches to media and
embeds relevant ads at certain preserved positions on the
Web pages. For example, Yahoo! [112] and BritePic [10]
provide relevant ads around images. In video domain,
Revver [88] and Youtube [118] which subscribe advertising
service from Google’s AdSense [1] employ pre-roll or post-
roll advertising (i.e., embed ads or related videos at the
very beginning or end of videos), or overlay the textual ads
on certain video frames (e.g., on the bottom fifth of
videos 15 seconds in).
It is observed that the first generation of multimedia
advertising primarily uses text rather than visual content to
match relevant ads. In other words, multimedia advertis-
ing has been treated as general text advertising without
considering the potential advantages which could be
brought by media contents. There are very few systems
in this generation to automatically monetize the opportu-
nities brought by individual images and videos. As a result,
the ads are only generally relevant to the entire Web page
containing images or videos rather than specific to the
images or videos it contains. Moreover, the ads are em-
bedded at a predefined position in a Web page adjacent to
the image or video, which normally destroys the visually
appealing appearance and structure of the original Web
page. It could not grab and monetize users’ attention
aroused by these compelling contents.
It has proved not suitable to treat multimedia ad-
vertising as general text advertising. The following dis-
tinctions between media (i.e., image and video) and text
advertising motivate a new advertising generation dedi-
cated to media.
•
Beyond the traditional media of Web pages, images
and videos can be powerful and effective carriers of
2Please note here, Bcontextual relevance[ mainly indicates that the
relevance is derived from the entire web content. In a broader view,
contextual relevance includes not only the relevance, but also the position
where the advertisements are inserted in a Web page.
online advertising. Compared with text, image and
video have some unique advantages which conse-
quently make them become the most pervasive
media formats on the Internet: they are more at-
tractive than plain text, and they have been found
to be more salient than text, thus they can grab
users’ attention instantly [29]; they carry more
information that can be comprehended more
quickly, just like an old saying, Ba picture is worth
thousands of words.[ Media like image and video
are now used almost as much as text in Web pages
and have become powerful information carriers for
online advertising. There is a new advertising
model using media as the carriers for advertising,
in which ads can leave a much deeper impression
due to the salience of visual signal
in human
perception.
The ads are expected to be locally relevant to media
content and the surrounding text, rather than
globally relevant to the entire Web page. Compel-
ling media content naturally would become the
region of interest (ROI) in a Web page. The most
effective way to advertise would be putting ads in-
formation precisely relevant to the media content,
hoping audience who are interested in this image
or video would have similar interests to the rele-
vant product or service advertised in it. It is likely
that the text in a Web page is either too much (e.g.,
using whole page text), or too few and/or too noisy
(e.g., image and video sharing sites), to accurately
describe an embedded image or video. On one
hand, ads picked only based on the whole page
content may not be contextually relevant enough
to the image or video in that page. On the other,
conventional ad-networks like AdSense [1] and
Adwords [2] cannot work well for the very few or
noisy textual contexts. Therefore, it is reasonable
to assume that the media content and its sur-
rounding text should have much more contribu-
tions to the relevance matching than the whole
Web page.
The ads are expected to be dynamically embedded at
the appropriate positions within each individual
image or video (i.e., in-media) rather than at a pre-
defined position in the Web page. In conventional
advertising, publishers have to reserve certain
predefined blocks (please refer to Fig. 1) in the
Web page for advertisingVbeing banners or other
forms of ads. Such advertising strategy has proved
intrusive to Internet users [74], as the ad blocks
have significantly broken the page structure and
visual appearance, as well as they are unattractive
or boring to users. Now that users’ attention is on
the media, by embedding the ads within an image
or video, the ads will in turn get more attention.
the publisher no longer needs to
Meanwhile,
Vol. 0, No. 0, 2010 | Proceedings of the IEEE 3
•
•
Authorized licensed use limited to: MICROSOFT. Downloaded on June 29,2010 at 10:30:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Mei and Hua: Contextual Internet Multimedia Advertising
Fig. 2. Screenshots of ImageSense, VideoSense, and GameSense. The highlighted areas in (a) and (b) correspond to the ads,
while a missing block in (c) corresponds to the ad position. The ads are inserted into the non-salient spatial within images or temporal
positions in video streams via visual saliency analysis. The ads are relevant to both the visual content and Web page rather than only
relevant to the entire Web page. (a) ImageSense [78]; (b) VideoSense [79], [80]; (c) GameSense [59], [60].
worry about the reserved blocks. By putting ads
only in the non-salient portions in an image or
video, it reduces the intrusiveness of display ads
and the user experience will be improved at the
same time.
Motivated from the above observations, we go one step
further from the first generation of multimedia advertising
and propose in this paper the second generation which
supports contextual multimedia advertising by associating
the most relevant ads to an online medium (image or
video) and seamlessly embedding the ads at the most
appropriate positions within this medium. As show in the
rightmost of Fig. 1, the ads are selected according to
multimodal relevance,
i.e., the ads are to be globally
relevant to the Web page containing images or videos, as
well as locally relevant to the content and surrounding text
of each suitable medium. Meanwhile, the ads are embedded
at the most non-salient positions within the medium. By
leveraging computer vision and multimedia retrieval
techniques, we are on the positive side to better solve two
challenges in the Internet multimedia advertising, i.e., ad
relevance and ad position. We demonstrate MediaSense
which includes ImageSense [78] and VideoSense [79], [80]
as two exemplary applications in the new generation,
dedicated to image and video, respectively. We also envision
that the next trend of multimedia advertising would be
game-like advertising which would be more impressiona-
tive and thus can promote the advertisements in an
interactive way. We show GameSense [59], [60] as an
example of the next advertising platform. Fig. 2 shows the
screenshots of ImageSense, VideoSense, and GameSense. It
is also worth noticing that many metrics have been adopted
to evaluate the performance of a multimedia advertising
system. We review the performance evaluation in different
domains.
The rest of the paper is organized as follows. Section II
provides a system overview of contextual multimedia
4 Proceedings of the IEEE | Vol. 0, No. 0, 2010
advertising, as well as the key problems. Sections III–V
address how we can leverage computer vision and multi-
media retrieval techniques to solve these problems in
details. Section VI
the implementations of
ImageSense, VideoSense, and GameSense. Section VII
discusses how to evaluate the performance of multimedia
advertising systems. Section VIII concludes this paper
and outlooks future challenges.
shows
I I. SYSTEM AND KEY PROBLEMS
A. Terminology
To clearly present the system framework of the two
generations of multimedia advertising, we will adopt a
standard vocabulary to describe many of the common
aspects and terms across each of the exemplary systems.
• Advertisement (ad): Advertisement is a public
notice or announcement for calling something to
the attention of the public, especially by paid
announcements. In multimedia advertising, adver-
tisements take a variety of forms, including text
banner, image, video (i.e., traditional TV commer-
cial), animation, or a combination of
forms.
Advertising is a form of communication that
typically attempts to persuade potential customers
to purchase or to consume more of a particular
brand of product or service. In this paper, we
mainly discuss two types of ads, i.e., image and
video ads.
Image ad: An image advertisement is a static
image or banner provided by advertisers that will
be inserted into or associated with a source image
or video. An image ad could be a product logo [59],
[60], [66], [69], [78] or a banner composed of a
product logo, product name, description, and link
[31], [76], [101], [118].
•
Authorized licensed use limited to: MICROSOFT. Downloaded on June 29,2010 at 10:30:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Mei and Hua: Contextual Internet Multimedia Advertising
•
• Video ad: A video ad is a video clip about a pro-
duct. Although video ads will be associated with a
video in MediaSense [25], [80], they may be in
different forms (or a combination of forms), in-
cluding typical commercials in TV programs [19],
[25], [39], [67], as well as a clip composed by text,
animations, or images.
Source media: Source media are most often
produced or owned by content providers (or
partially by publishers who help to distribute the
contents), which may be images, videos, or audio
clips, captured and provided by professional
photographers, videographers, or grassroots. The
advertisements will be embedded at certain
positions around, within, or overlay the source
media.
• Ad insertion point: A point/position where one or
more advertisements will be associated. Ad inser-
tion point could be a spatial region around the
medium in a Web page, a region on an image, a
spot on the timeline of a video, a spatiotemporal
patch in a video, or even a position out of the Web
page. For example,
the highlight rectangle in
Fig. 1(a), the yellow spots on the timeline in
Fig. 1(b), and the center missing block in Fig. 1(c)
are ad insertion points.
• Contextual advertising: Contextual advertising
refers to the placement of commercial advertise-
ments within the content of a generic Web page
based on similarity between the content of the
target page and the ad description provided by the
advertiser [11], [83]. If the advertisements will be
associated with media, then this type of advertising
is contextual multimedia advertising.
• Multimodal relevance: A modality is defined as
any source of information about media contents
that can be leveraged algorithmically for analysis in
[52]. In multimedia advertising applications, the
modality can be decomposed into various modal-
ities which measure various low-level aspects of
visual data (such as the color and textures in an
image, the motions in a video sequence, as well as
the tempos in an audio stream), some mid-level
visual concept or object categories (such as people,
location, and objects), as well as high-level textual
descriptions associated with the data (such as user-
provided tags on an image or video, transcripts
associated with a video stream, automatically rec-
ognized captions on a video frame). The relevance
can be measured by the similarity between two
media files in terms of certain types of modalities.
For example, there are textual, visual, and aural
relevance. Accordingly, the multimodal relevance
is a combination of
the results from various
modalities between two media.
B. General Framework
Fig. 3 shows the general framework of multimedia
advertising. It also summarizes the distinctive between the
two generations of multimedia advertising, as we men-
tioned in Section I. Given an online medium which could
be an image within a Web page, a collection of image
search results, a video sequence, or even an audio stream, a
list of candidate ads are selected from an ad inventory and
Fig. 3. A general framework of multimedia advertising. Conventional advertising only focuses on text-based ad relevance matching,
while contextual advertising like MediaSense [40] considers not only textual relevance but also multimodal (textual, visual, and
aural) relevance, as well as automatic detection of appropriate ad insertion points within media.
Authorized licensed use limited to: MICROSOFT. Downloaded on June 29,2010 at 10:30:21 UTC from IEEE Xplore. Restrictions apply.
Vol. 0, No. 0, 2010 | Proceedings of the IEEE 5
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Mei and Hua: Contextual Internet Multimedia Advertising
ranked according to the relevance between the source
medium and the ads. The multimodal relevance can be
derived from textual information, such as user-provided
tags, video transcripts, automatically recognized captions,
and also can be derived from the low-level similarity in
terms of color and textures in images, camera or object
motions in video sequences, or tempo and beat in audio
streams, as well as semantic level in terms of automatically
detected visual concepts and object categories [27], [77],
[84], [95]. Meanwhile, a set of candidate ad insertion
points are automatically detected on the basis of spatio-
temporal visual saliency analysis [15], [64], [69], [73],
[78]–[80], [103]. Intuitively, the ads would be inserted
into the most non-salient positions within the media
contents so that the ads would be nonintrusive to the
viewers. Moreover, the ads associated with the media
would be the most relevant to the contents. By minimizing
contextual intrusiveness and maximizing contextual rele-
vance simultaneously, the effectiveness of advertising can
be maximized [74]. Given a list of candidate ads and ad
insertion points, an optimization-based association module
will associate each candidate ad with the best insertion
point.
We can see from Fig. 3 that conventional multimedia
advertising only focuses on ad relevance matching, re-
ferred to as Btext-based ad matching[ and Bcandidate ad
list[ modules, while the new generation of advertising not
only considers Bmultimodal ad matching[ for ad selection
but also investigates the problem of Bad insertion point
detection.[ Finally, given a set of candidate ads and ad
insertion points,
the Boptimization-based ad delivery[
module will associate the most relevant ads with the most
appropriate insertion points by maximizing the overall
relevance while minimizing the overall
contextual
intrusiveness.
C. Key Problems
In general, there are four key problems in an effective
contextual multimedia advertising system: contextual rele-
vance, contextual intrusiveness, insertion optimization, and
rich displaying.
• Contextual relevanceVWhich ads should be se-
lected for a given image or video? Since relevance
increases advertising revenue [57], [74], contextual
multimedia advertising performs multimodal rele-
vance matching by considering both global textual
relevance from the entire Web page and local
relevance from textual information associated with
• Contextual
the media content, as well as low-level visual and
high-level semantic similarity between the ads and
ad insertion points.
intrusivenessVWhere should the
selected ads be inserted so that the contextual
intrusiveness will be minimized? Ad position will
certainly affect user experience when an image or a
video is viewed [74]. In contextual multimedia
advertising, the selected ads are to be inserted into
the most non-intrusive positions within the media.
Insertion optimizationVGiven a ranked list of
candidate ads and ad insertion points, how to
associate each ad with the ad best insertion point?
The objective is to maximize the effectiveness of
advertising by simultaneously minimizing the
contextual intrusiveness to viewers and maximiz-
ing the contextual relevance between the ads and
media.
•
• Rich displayingVHow the selected ads are
displayed or rendered? The rich displaying in-
cludes the duration of each ad, the way the ad is
rendered, the support of interaction between the
ad and users, the rich information associated with
ads. An effective displaying will make the adver-
tising not have an intrusive experience to users.
In this paper, we mainly focus on the first three prob-
lems while leave the fourth an open issue. The compar-
isons between conventional and the proposed contextual
multimedia advertising in terms of contextual relevance
and contextual intrusiveness are listed in Table 1.
I II . CONTEXTUAL RELEVANCE
One of the fundamental problems in contextual advertis-
ing is Brelevance[ which in studies detracts from user
experience and increases the probability of reaction [57],
[74]. By contextual relevance, we refer to the fact that ads
are expected to be relevant both to the entire source media
and the local ad insertion points within the media. The
contextual relevance for each pair of ad and ad insertion
point is a multimodal relevance consisting of textual,
visual, conceptual, and user relevance. In this section, we
will review the relevance from different modalities.
A. Textual Relevance
The major effort in advertising has focused on text do-
main. There exists rich research in the literature on textual
relevance that can be leveraged or applied to multimedia
Table 1 Comparisons Between Conventional and Contextual Multimedia Advertising, in Terms of Contextual Relevance and Contextual Intrusiveness
6 Proceedings of the IEEE | Vol. 0, No. 0, 2010
Authorized licensed use limited to: MICROSOFT. Downloaded on June 29,2010 at 10:30:21 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Mei and Hua: Contextual Internet Multimedia Advertising
advertising. The literature review on text relevance in this
paper will focus on two key issues: 1) ad keyword selection,
i.e., how to pick suitable keywords, Web pages, or images
for advertising so that the relevance can be improved, and
2) ad relevance matching, i.e., how to select relevant ads
according to a set of selected keywords or a Web page.
Typical advertising systems analyze a Web page or
query to find prominent keywords or categories, and then
match these keywords or categories against the words for
which advertisers bid. If there is a match, the correspond-
ing ads will be displayed to the user through the web page.
Yih et al. has studied a learning-based approach to auto-
matically extracting appropriate keywords from Web pages
for advertisement targeting [117]. Instead of dealing with
general Web pages, Li et al. propose a sequential pattern
mining-based method to discover keywords from a specific
broadcasting content domain [58]. In addition to Web
pages, queries also play an important role in paid search
advertising. In [94], the queries are classified into an in-
termediate taxonomy so that the selected ads are more
targeted to the query. The works in [56], [59], [60], [78]–
[80] present the methods for detecting potential suitable
images or videos in a Web page for advertising, by analyz-
ing the structure of Web page and the visual appearance of
images or videos.
As we have mentioned in Section I, research on ad
relevance has proceeded along three dimensions from the
perspective of what
the ads are matched against:
1) keyword-targeted advertising (Bpaid search advertising[
or Bsponsored search[), 2) content-targeted advertising
(Bcontextual advertising[), and 3) user-targeted advertis-
ing (Baudience intelligence[). Although the paid search
market develops quicker than contextual advertising mar-
ket, and most textual ads are still characterized by Bbid
phrases,[ there has been a drift to contextual advertising as
it supports a long-tail business model [57]. For example, a
recent work examines a number of strategies to match ads
to Web pages based on extracted keywords [89]. A follow-
up work applies Genetic Programming (GP) to learn func-
tions that select the most appropriate ads, given the
contents of a Web page [54]. To alleviate the problem
of exact keyword match in conventional advertising,
Broder et al. propose to integrate semantic phrase into
traditional keyword matching [11]. Specifically, both the
pages and ads are classified into a common large taxo-
nomy, which is then used to narrow down the search of
keywords to concepts. Most recently, Hu et al. propose to
predict user demographics from browsing behavior [37].
The intuition is that while user demographics are not easy
to obtain, browsing behaviors indicate a user’s interest and
profile.
When applying textual relevance to visual domain, in
addition to the techniques discussed above in text domain,
the characteristics of media should be taken into account
from the following perspectives: 1) The entire texts in the
Web page are too noisy and broad to describe the media
embedded in the page, while the surrounding texts which
are spatially close to the media can better describe the
contents and lead to better ad relevance. 2) The sur-
rounding texts associated with an image or video are
sometimes too few for selecting relevant ads. The hidden
texts (e.g., expanded words, visual concepts, object
categories, or events) which are automatically recognized
from visual signals can more precisely describe the media
contents. Using the surrounding and hidden texts can yield
better textual relevance.
Given a Web page containing images or videos, it is
desirable to first segment it into several blocks with
coherent topic, detect the blocks with suitable images or
videos for advertising, and extract the semantic structure
such as the surrounding texts from these blocks. The
Vision-based Page Segmentation (VIPS) algorithm [12],
[13] is adopted to extract the surrounding texts associated
with a medium in [59], [60], [78], [79]. The VIPS algo-
rithm makes full use of page layout structure. It first
extracts all the suitable blocks from the Document Object
Model (DOM) tree in html, and then finds the separators
between these blocks. Based on these separators, a Web
page can be represented by a semantic tree in which each
leaf node corresponds to a block. In this way, contents with
different topics are distinguished as separate blocks in a
Web page. Fig. 4(a) and (b) illustrate the vision-based
structured of a sample page. It is observed that this page
has two main blocks and the block named BVB-1-2-1-1[ is
detected as the video block. Specifically, after obtaining all
the blocks via VIPS, the images or videos which are suit-
able for advertising in the Web page are elaborately
selected. Intuitively, the images or videos with poor visual
qualities, or belonging to the advertisements (usually
placed in certain positions of a page) or decorations
(usually are too small), are first filtered out. Then, the
corresponding blocks with the remaining images or videos
are selected as the advertising page blocks. The surround-
ing texts (e.g., title and description) are used to describe
each image or video.
Based on the surrounding texts, the expansion text can
be obtained by leveraging query expansion based on user
log [21], while the hidden texts are obtained by automatic
text categorization [116] and video concept detection [77],
[87]. Specifically, we use text categorization based on
Support V