Page-level Template Detection via Isotonic Smoothing
posted by shuri on 2007-05-10 10:00:26
tags: www2007,mynotebook,news
Cute work about template detection, short summary follows.

Previous work, site based, two phase. The limitations of this technique, pages may not be processed in site order, new sites may be a problem and processing may be inefficient.
Essentially, they:
  1. obtain training data site specific
  2. learn site specific templates
  3. try to learn a global detector for templateness.
Features they use include: placement on the screen,back ground color, identify series of links that are likely to be part of the template, average sentence size. Then they use a classifier to differentiate between the template parts of a page and the content.
In the results they show that shingling after template detection works better than shingling without template detection.