www2007 initial summary

The www2007 conference is over. It was fun. There were some real good presentations. I summarized and linked to some below. I am sure there are many more good papers that may have been missed during the conference due to the presentations.
The yahoo party was fun and I won a squeezebox music player :). The banquet was fun and I thought the food was good.
Banff is amazing, a small town of 6700 people according to wikipedia which is largely only there for tourists. The main street is a long street with almost nothing but restaurants and gift shops. Lake Louise is really close by and everything is beautiful. Wild life, snow, mountains, forests, all very beautiful.
One of the interesting things about the www conference is that it is so diverse, people from the academia and industry all come here to look for good ideas. Furthermore, the Internet touches almost every field these days and the conference is just huge. A production of this scale is really difficult and all in all I think it was a great success.
So, I hope to be back in Beijing, China in 2008.
Why We Search: Visualizing and Predicting User Behavior

This paper by Eitan Adar et al. is an interesting paper that tries to find correlations between topic event streams generated from blogs and news sites, and try to use one stream to predict the others shape. They use dynamic time warping to map individual segments of the curves such as peak, rise, fall and run.
They explore various ways of visualizing the topic behavior through time.
Learning to Detect Phishing Emails

Very nice work by Ian Fette et al. The first thing they do is define "identifying phishing spam emails" as a different problem than just regular spam. Then they use a decision tree based classifier and a set of smart features to identify phising attacks.
The features include: when the domains in the links were registered, ip number links and comparison of the domains of the links in the email to the domain of the "click here" type of links.
Predicting Clicks: Estimating the Click-Through Rate for New Ads

How to determine ad ordering if you do not have extensive click-through-rate probabilities? That is what this paper does. They use machine learning, logistic regression, to predict the click-through-rate (CTR).
The basic model builds on previous work. The first thing they try to add a notion of ad quality, the landing page quality, and relevance. They further tried to improve the results by adding features, which key terms appear in the title and the tex, and using machine learning to learn quality.
A New Suffix Tree Similarity Measure for Document Clustering

This paper, which is based on this interesting paper talks about a new similarity measure. Looks cool, still need to read the details. Basically it combines the suffix tree document model with tfidf.
Finished www2007 presentation

Just finished the www2007 presentation Do Not Crawl In The DUST: Duplicate URLs Similar Text. As far as I can tell it went pretty well. If you missed the presentation you'll have to read the paper instead.
Page-level Template Detection via Isotonic Smoothing

Cute work about template detection, short summary follows.

Previous work, site based, two phase. The limitations of this technique, pages may not be processed in site order, new sites may be a problem and processing may be inefficient.
Essentially, they:
  1. obtain training data site specific
  2. learn site specific templates
  3. try to learn a global detector for templateness.
Features they use include: placement on the screen,back ground color, identify series of links that are likely to be part of the template, average sentence size. Then they use a classifier to differentiate between the template parts of a page and the content.
In the results they show that shingling after template detection works better than shingling without template detection.
Web Projections: Learning from Contextual Subgraphs of the Web

General idea of "Web Projections: Learning from Contextual Subgraphs of the Web" is trying to extract a sub-graph according to some context, for example a query, and then using that sub-graph and machine learning to predict such things as the quality of the pages, and user behavior. Cool.
Efficient Search Engine Measurements

If you happen to miss the www2007 talk "Efficient Search Engine Measurements" by Ziv Bar-Yossef and Maxim Gurevich you should go and read the paper.
The paper describes an efficient and accurate method of estimating various properties of the search engine such as the size of the document collection. It does so through the standard query interface. I will not do it justice if I try to describe the details so go and read it.
Navigation-Aided Retrieval by Pandit and Olston

The basic idea of this work is to assume the user of the search engine is willing to do some navigation to find what he is looking for.
The question then becomes not what is the most relevant document but where should we "drop off" the user, for him to be most likely to find what he is looking for. Cool.
Further, they highlight the paths that could lead the user to interesting pages.
For those not in WWW2007

If you are not in www2007, and you still want to see a cool lecture, go here and look for PRABHAKAR RAGHAVAN. This excellent lecture covers both Yahoo answers and advertisement auctions. The implication of any optimizations to advertisement auctions means big money and that is why you should care.
WWW2007 worth a read

When attending the Query Log Analysis of the WWW2007 conference, this work seems good. The presentation talks about a better model of search engine users and the way they click. For example, the user model takes into account if the user considered a result and its attractiveness.
I am in WWW2007

That is it. I am here in Banff Canda, in the WWW 2007 conference. I will be presenting my paper Do Not Crawl in the DUST about identifying different URLs with similar text. I am excited to see my Israeli colleagues, Maxim Gurevich and Ziv Bar-Yossef who are also presenting a paper about efficient search engine measurement.
I will try and update the web site with anything I find interesting.