git.maemo.org Git - pkg-perl/blob - deb-src/libhtml-tree-perl/libhtml-tree-perl-3.23/lib/HTML/Tree/Scanning.pod

   1
   2 #Time-stamp: "2001-03-10 23:19:11 MST" -*-Text-*-
   3 # This document contains text in Perl "POD" format.
   4 # Use a POD viewer like perldoc or perlman to render it.
   5
   6 =head1 NAME
   7
   8 HTML::Tree::Scanning -- article: "Scanning HTML"
   9
  10 =head1 SYNOPSIS
  11
  12   # This an article, not a module.
  13
  14 =head1 DESCRIPTION
  15
  16 The following article by Sean M. Burke first appeared in I<The Perl
  17 Journal> #19 and is copyright 2000 The Perl Journal. It appears
  18 courtesy of Jon Orwant and The Perl Journal.  This document may be
  19 distributed under the same terms as Perl itself.
  20
  21 =head1 Scanning HTML
  22
  23 -- Sean M. Burke
  24
  25 In I<The Perl Journal> issue 17, Ken MacFarlane's article "Parsing
  26 HTML with HTML::Parser" describes how the HTML::Parser module scans
  27 HTML source as a stream of start-tags, end-tags, text, comments, etc.
  28 In TPJ #18, my "Trees" article kicked around the idea of tree-shaped
  29 data structures.  Now I'll try to tie it together, in a discussion of
  30 HTML trees.
  31
  32 The CPAN module HTML::TreeBuilder takes the
  33 tags that HTML::Parser picks out, and builds a parse tree -- a
  34 tree-shaped network of objects...
  35
  36 =over
  37
  38 Footnote:
  39 And if you need a quick explanation of objects, see my TPJ17 article "A
  40 User's View of Object-Oriented Modules"; or go whole hog and get Damian
  41 Conway's excellent book I<Object-Oriented Perl>, from Manning
  42 Publications.
  43
  44 =back
  45
  46 ...representing the structured content of the HTML document.  And once
  47 the document is parsed as a tree, you'll find the common tasks
  48 of extracting data from that HTML document/tree to be quite
  49 straightforward.
  50
  51 =head2 HTML::Parser, HTML::TreeBuilder, and HTML::Element
  52
  53 You use HTML::TreeBuilder to make a parse tree out of an HTML source
  54 file, by simply saying:
  55
  56   use HTML::TreeBuilder;
  57   my $tree = HTML::TreeBuilder->new();
  58   $tree->parse_file('foo.html');
  59
  60 and then C<$tree> contains a parse tree built from the HTML source from
  61 the file "foo.html".  The way this parse tree is represented is with a
  62 network of objects -- C<$tree> is the root, an element with tag-name
  63 "html", and its children typically include a "head" and "body" element,
  64 and so on.  Elements in the tree are objects of the class
  65 HTML::Element.
  66
  67 So, if you take this source:
  68
  69   <html><head><title>Doc 1</title></head>
  70   <body>
  71   Stuff <hr> 2000-08-17
  72   </body></html>
  73
  74 and feed it to HTML::TreeBuilder, it'll return a tree of objects that
  75 looks like this:
  76
  77                html
  78              /      \
  79          head        body
  80         /          /   |  \
  81      title    "Stuff"  hr  "2000-08-17"
  82        |
  83     "Doc 1"
  84
  85 This is a pretty simple document, but if it were any more complex,
  86 it'd be a bit hard to draw in that style, since it's sprawl left and
  87 right.  The same tree can be represented a bit more easily sideways,
  88 with indenting:
  89
  90   . html
  91      . head
  92         . title
  93            . "Doc 1"
  94      . body
  95         . "Stuff"
  96         . hr
  97         . "2000-08-17"
  98
  99 Either way expresses the same structure.  In that structure, the root
 100 node is an object of the class HTML::Element
 101
 102 =over
 103
 104 Footnote:
 105 Well actually, the root is of the class HTML::TreeBuilder, but that's
 106 just a subclass of HTML::Element, plus the few extra methods like
 107 C<parse_file> that elaborate the tree
 108
 109 =back
 110
 111 , with the tag name "html", and with two children: an HTML::Element
 112 object whose tag names are "head" and "body".  And each of those
 113 elements have children, and so on down.  Not all elements (as we'll
 114 call the objects of class HTML::Element) have children -- the "hr"
 115 element doesn't.  And note all nodes in the tree are elements -- the
 116 text nodes ("Doc 1", "Stuff", and "2000-08-17") are just strings.
 117
 118 Objects of the class HTML::Element each have three noteworthy attributes:
 119
 120 =over
 121
 122 =item C<_tag> -- (best accessed as C<$e-E<gt>tag>)
 123 this element's tag-name, lowercased (e.g., "em" for an "em" element).
 124
 125 =over
 126
 127 Footnote: Yes, this is misnamed.  In proper SGML terminology, this is
 128 instead called a "GI", short for "generic identifier"; and the term
 129 "tag" is used for a token of SGML source that represents either
 130 the start of an element (a start-tag like "<em lang='fr'>") or the end
 131 of an element (an end-tag like "</em>".  However, since more people
 132 claim to have been abducted by aliens than to have ever seen the
 133 SGML standard, and since both encounters typically involve a feeling of
 134 "missing time", it's not surprising that the terminology of the SGML
 135 standard is not closely followed.
 136
 137 =back
 138
 139 =item C<_parent> -- (best accessed as C<$e-E<gt>parent>)
 140 the element that is C<$obj>'s parent, or undef if this element is the
 141 root of its tree.
 142
 143 =item C<_content> -- (best accessed as C<$e-E<gt>content_list>)
 144 the list of nodes (i.e., elements or text segments) that are C<$e>'s
 145 children.
 146
 147 =back
 148
 149 Moreover, if an element object has any attributes in the SGML sense of
 150 the word, then those are readable as C<$e-E<gt>attr('name')> -- for
 151 example, with the object built from having parsed "E<lt>a
 152 B<id='foo'>E<gt>barE<lt>/aE<gt>", C<$e-E<gt>attr('id')> will return
 153 the string "foo".  Moreover, C<$e-E<gt>tag> on that object returns the
 154 string "a", C<$e-E<gt>content_list> returns a list consisting of just
 155 the single scalar "bar", and C<$e-E<gt>parent> returns the object
 156 that's this node's parent -- which may be, for example, a "p" element.
 157
 158 And that's all that there is to it -- you throw HTML
 159 source at TreeBuilder, and it returns a tree built of HTML::Element
 160 objects and some text strings.
 161
 162 However, what do you I<do> with a tree of objects?  People code
 163 information into HTML trees not for the fun of arranging elements, but
 164 to represent the structure of specific text and images -- some text is
 165 in this "li" element, some other text is in that heading, some
 166 images are in that other table cell that has those attributes, and so on.
 167
 168 Now, it may happen that you're rendering that whole HTML tree into some
 169 layout format.  Or you could be trying to make some systematic change to
 170 the HTML tree before dumping it out as HTML source again.  But, in my
 171 experience, by far the most common programming task that Perl
 172 programmers face with HTML is in trying to extract some piece
 173 of information from a larger document.  Since that's so common (and
 174 also since it involves concepts that are basic to more complex tasks),
 175 that is what the rest of this article will be about.
 176
 177 =head2 Scanning HTML trees
 178
 179 Suppose you have a thousand HTML documents, each of them a press
 180 release.  They all start out:
 181
 182   [...lots of leading images and junk...]
 183   <h1>ConGlomCo to Open New Corporate Office in Ougadougou</h1>
 184   BAKERSFIELD, CA, 2000-04-24 -- ConGlomCo's vice president in charge
 185   of world conquest, Rock Feldspar, announced today the opening of a
 186   new office in Ougadougou, the capital city of Burkino Faso, gateway
 187   to the bustling "Silicon Sahara" of Africa...
 188   [...etc...]
 189
 190 ...and what you've got to do is, for each document, copy whatever text
 191 is in the "h1" element, so that you can, for example, make a table of
 192 contents of it.  Now, there are three ways to do this:
 193
 194 =over
 195
 196 =item * You can just use a regexp to scan the file for a text pattern.
 197
 198 For many very simple tasks, this will do fine.  Many HTML documents are,
 199 in practice, very consistently formatted as far as placement of
 200 linebreaks and whitespace, so you could just get away with scanning the
 201 file like so:
 202
 203   sub get_heading {
 204     my $filename = $_[0];
 205     local *HTML;
 206     open(HTML, $filename)
 207       or die "Couldn't open $filename);
 208     my $heading;
 209    Line:
 210     while(<HTML>) {
 211       if( m{<h1>(.*?)</h1>}i ) {  # match it!
 212         $heading = $1;
 213         last Line;
 214       }
 215     }
 216     close(HTML);
 217     warn "No heading in $filename?"
 218      unless defined $heading;
 219     return $heading;
 220   }
 221
 222 This is quick and fast, but awfully fragile -- if there's a newline in
 223 the middle of a heading's text, it won't match the above regexp, and
 224 you'll get an error.  The regexp will also fail if the "h1" element's
 225 start-tag has any attributes.  If you have to adapt your code to fit
 226 more kinds of start-tags, you'll end up basically reinventing part of
 227 HTML::Parser, at which point you should probably just stop, and use
 228 HTML::Parser itself:
 229
 230 =item * You can use HTML::Parser to scan the file for an "h1" start-tag
 231 token, then capture all the text tokens until the "h1" close-tag.  This
 232 approach is extensively covered in the Ken MacFarlane's TPJ17 article
 233 "Parsing HTML with HTML::Parser".  (A variant of this approach is to use
 234 HTML::TokeParser, which presents a different and rather handier
 235 interface to the tokens that HTML::Parser picks out.)
 236
 237 Using HTML::Parser is less fragile than our first approach, since it's
 238 not sensitive to the exact internal formatting of the start-tag (much
 239 less whether it's split across two lines).  However, when you need more
 240 information about the context of the "h1" element, or if you're having
 241 to deal with any of the tricky bits of HTML, such as parsing of tables,
 242 you'll find out the flat list of tokens that HTML::Parser returns
 243 isn't immediately useful.  To get something useful out of those tokens,
 244 you'll need to write code that knows some things about what elements
 245 take no content (as with "hr" elements), and that a "</p>" end-tags
 246 are omissible, so a "<p>" will end any currently
 247 open paragraph -- and you're well on your way to pointlessly
 248 reinventing much of the code in HTML::TreeBuilder
 249
 250 =over
 251
 252 Footnote:
 253 And, as the person who last rewrote that module, I can attest that it
 254 wasn't terribly easy to get right!  Never underestimate the perversity
 255 of people coding HTML.
 256
 257 =back
 258
 259 , at which point you should probably just stop, and use
 260 HTML::TreeBuilder itself:
 261
 262 =item * You can use HTML::Treebuilder, and scan the tree of element
 263 objects that you get back.
 264
 265 =back
 266
 267 The last approach, using HTML::TreeBuilder, is the diametric opposite of
 268 first approach:  The first approach involves just elementary Perl and one
 269 regexp, whereas the TreeBuilder approach involves being at home with
 270 the concept of tree-shaped data structures and modules with
 271 object-oriented interfaces, as well as with the particular interfaces
 272 that HTML::TreeBuilder and HTML::Element provide.
 273
 274 However, what the TreeBuilder approach has going for it is that it's
 275 the most robust, because it involves dealing with HTML in its "native"
 276 format -- it deals with the tree structure that HTML code represents,
 277 without any consideration of how the source is coded and with what
 278 tags omitted.
 279
 280 So, to extract the text from the "h1" elements of an HTML document:
 281
 282   sub get_heading {
 283     my $tree = HTML::TreeBuilder->new;
 284     $tree->parse_file($_[0]);   # !
 285     my $heading;
 286     my $h1 = $tree->look_down('_tag', 'h1');  # !
 287     if($h1) {
 288       $heading = $h1->as_text;   # !
 289     } else {
 290       warn "No heading in $_[0]?";
 291     }
 292     $tree->delete; # clear memory!
 293     return $heading;
 294   }
 295
 296 This uses some unfamiliar methods that need explaining.  The
 297 C<parse_file> method that we've seen before, builds a tree based on
 298 source from the file given.  The C<delete> method is for marking a
 299 tree's contents as available for garbage collection, when you're done
 300 with the tree.  The C<as_text> method returns a string that contains
 301 all the text bits that are children (or otherwise descendants) of the
 302 given node -- to get the text content of the C<$h1> object, we could
 303 just say:
 304
 305   $heading = join '', $h1->content_list;
 306
 307 but that will work only if we're sure that the "h1" element's children
 308 will be only text bits -- if the document contained:
 309
 310   <h1>Local Man Sees <cite>Blade</cite> Again</h1>
 311
 312 then the sub-tree would be:
 313
 314   . h1
 315     . "Local Man Sees "
 316     . cite
 317       . "Blade"
 318     . " Again'
 319
 320 so C<join '', $h1-E<gt>content_list> will be something like:
 321
 322   Local Man Sees HTML::Element=HASH(0x15424040) Again
 323
 324 whereas C<$h1-E<gt>as_text> would yield:
 325
 326   Local Man Sees Blade Again
 327
 328 and depending on what you're doing with the heading text, you might
 329 want the C<as_HTML> method instead.  It returns the (sub)tree
 330 represented as HTML source.  C<$h1-E<gt>as_HTML> would yield:
 331
 332   <h1>Local Man Sees <cite>Blade</cite> Again</h1>
 333
 334 However, if you wanted the contents of C<$h1> as HTML, but not the
 335 C<$h1> itself, you could say:
 336
 337   join '',
 338     map(
 339       ref($_) ? $_->as_HTML : $_,
 340       $h1->content_list
 341     )
 342
 343 This C<map> iterates over the nodes in C<$h1>'s list of children; and
 344 for each node that's just a text bit (as "Local Man Sees " is), it just
 345 passes through that string value, and for each node that's an actual
 346 object (causing C<ref> to be true), C<as_HTML> will used instead of the
 347 string value of the object itself (which would be something quite
 348 useless, as most object values are).  So that C<as_HTML> for the "cite"
 349 element will be the string "<cite>BladeE<lt>/cite>".  And then,
 350 finally, C<join> just puts into one string all the strings that the
 351 C<map> returns.
 352
 353 Last but not least, the most important method in our C<get_heading> sub
 354 is the C<look_down> method.  This method looks down at the subtree
 355 starting at the given object (C<$h1>), looking for elements that meet
 356 criteria you provide.
 357
 358 The criteria are specified in the method's argument list.  Each
 359 criterion can consist of two scalars, a key and a value, which express
 360 that you want elements that have that attribute (like "_tag", or
 361 "src") with the given value ("h1"); or the criterion can be a
 362 reference to a subroutine that, when called on the given element,
 363 returns true if that is a node you're looking for.  If you specify
 364 several criteria, then that's taken to mean that you want all the
 365 elements that each satisfy I<all> the criteria.  (In other words,
 366 there's an "implicit AND".)
 367
 368 And finally, there's a bit of an optimization -- if you call the
 369 C<look_down> method in a scalar context, you get just the I<first> node
 370 (or undef if none) -- and, in fact, once C<look_down> finds that first
 371 matching element, it doesn't bother looking any further.
 372
 373 So the example:
 374
 375   $h1 = $tree->look_down('_tag', 'h1');
 376
 377 returns the first element at-or-under C<$tree> whose C<"_tag">
 378 attribute has the value C<"h1">.
 379
 380 =head2 Complex Criteria in Tree Scanning
 381
 382 Now, the above C<look_down> code looks like a lot of bother, with
 383 barely more benefit than just grepping the file!  But consider if your
 384 criteria were more complicated -- suppose you found that some of the
 385 press releases that you were scanning had several "h1" elements,
 386 possibly before or after the one you actually want.  For example:
 387
 388   <h1><center>Visit Our Corporate Partner
 389    <br><a href="/dyna/clickthru"
 390      ><img src="/dyna/vend_ad"></a>
 391   </center></h1>
 392   <h1><center>ConGlomCo President Schreck to Visit Regional HQ
 393    <br><a href="/photos/Schreck_visit_large.jpg"
 394      ><img src="/photos/Schreck_visit.jpg"></a>
 395   </center></h1>
 396
 397 Here, you want to ignore the first "h1" element because it contains an
 398 ad, and you want the text from the second "h1".  The problem is in
 399 formalizing the way you know that it's an ad.  Since ad banners are
 400 always entreating you to "visit" the sponsoring site, you could exclude
 401 "h1" elements that contain the word "visit" under them:
 402
 403   my $real_h1 = $tree->look_down(
 404     '_tag', 'h1',
 405     sub {
 406       $_[0]->as_text !~ m/\bvisit/i
 407     }
 408   );
 409
 410 The first criterion looks for "h1" elements, and the second criterion
 411 limits those to only the ones whose text content doesn't match
 412 C<m/\bvisit/>.  But unfortunately, that won't work for our example,
 413 since the second "h1" mentions "ConGlomCo President Schreck to
 414 I<Visit> Regional HQ".
 415
 416 Instead you could try looking for the first "h1" element that
 417 doesn't contain an image:
 418
 419   my $real_h1 = $tree->look_down(
 420     '_tag', 'h1',
 421     sub {
 422       not $_[0]->look_down('_tag', 'img')
 423     }
 424   );
 425
 426 This criterion sub might seem a bit odd, since it calls C<look_down>
 427 as part of a larger C<look_down> operation, but that's fine.  Note that
 428 when considered as a boolean value, a C<look_down> in a scalar context
 429 value returns false (specifically, undef) if there's no matching element
 430 at or under the given element; and it returns the first matching
 431 element (which, being a reference and object, is always a true value),
 432 if any matches.  So, here,
 433
 434   sub {
 435     not $_[0]->look_down('_tag', 'img')
 436   }
 437
 438 means "return true only if this element has no 'img' element as
 439 descendants (and isn't an 'img' element itself)."
 440
 441 This correctly filters out the first "h1" that contains the ad, but it
 442 also incorrectly filters out the second "h1" that contains a
 443 non-advertisement photo besides the headline text you want.
 444
 445 There clearly are detectable differences between the first and second
 446 "h1" elements -- the only second one contains the string "Schreck", and
 447 we could just test for that:
 448
 449   my $real_h1 = $tree->look_down(
 450     '_tag', 'h1',
 451     sub {
 452       $_[0]->as_text =~ m{Schreck}
 453     }
 454   );
 455
 456 And that works fine for this one example, but unless all thousand of
 457 your press releases have "Schreck" in the headline, that's just not a
 458 general solution.  However, if all the ads-in-"h1"s that you want to
 459 exclude involve a link whose URL involves "/dyna/", then you can use
 460 that:
 461
 462   my $real_h1 = $tree->look_down(
 463     '_tag', 'h1',
 464     sub {
 465       my $link = $_[0]->look_down('_tag','a');
 466       return 1 unless $link;
 467         # no link means it's fine
 468       return 0 if $link->attr('href') =~ m{/dyna/};
 469         # a link to there is bad
 470       return 1; # otherwise okay
 471     }
 472   );
 473
 474 Or you can look at it another way and say that you want the first "h1"
 475 element that either contains no images, or else whose image has a "src"
 476 attribute whose value contains "/photos/":
 477
 478   my $real_h1 = $tree->look_down(
 479     '_tag', 'h1',
 480     sub {
 481       my $img = $_[0]->look_down('_tag','img');
 482       return 1 unless $img;
 483         # no image means it's fine
 484       return 1 if $img->attr('src') =~ m{/photos/};
 485         # good if a photo
 486       return 0; # otherwise bad
 487     }
 488   );
 489
 490 Recall that this use of C<look_down> in a scalar context means to return
 491 the first element at or under C<$tree> that matches all the criteria.
 492 But if you notice that you can formulate criteria that'll match several
 493 possible "h1" elements, some of which may be bogus but the I<last> one
 494 of which is always the one you want, then you can use C<look_down> in a
 495 list context, and just use the last element of that list:
 496
 497   my @h1s = $tree->look_down(
 498     '_tag', 'h1',
 499     ...maybe more criteria...
 500   );
 501   die "What, no h1s here?" unless @h1s;
 502   my $real_h1 = $h1s[-1]; # last or only
 503
 504 =head2 A Case Study: Scanning Yahoo News's HTML
 505
 506 The above (somewhat contrived) case involves extracting data from a
 507 bunch of pre-existing HTML files.  In that sort of situation, if your
 508 code works for all the files, then you know that the code I<works> --
 509 since the data it's meant to handle won't go changing or growing; and,
 510 typically, once you've used the program, you'll never need to use it
 511 again.
 512
 513 The other kind of situation faced in many data extraction tasks is
 514 where the program is used recurringly to handle new data -- such as
 515 from ever-changing Web pages.  As a real-world example of this,
 516 consider a program that you could use (suppose it's crontabbed) to
 517 extract headline-links from subsections of Yahoo News
 518 (C<http://dailynews.yahoo.com/>).
 519
 520 Yahoo News has several subsections:
 521
 522 =over
 523
 524 =item http://dailynews.yahoo.com/h/tc/ for technology news
 525
 526 =item http://dailynews.yahoo.com/h/sc/ for science news
 527
 528 =item http://dailynews.yahoo.com/h/hl/ for health news
 529
 530 =item http://dailynews.yahoo.com/h/wl/ for world news
 531
 532 =item http://dailynews.yahoo.com/h/en/ for entertainment news
 533
 534 =back
 535
 536 and others.  All of them are built on the same basic HTML template --
 537 and a scarily complicated template it is, especially when you look at
 538 it with an eye toward making up rules that will select where the real
 539 headline-links are, while screening out all the links to other parts of
 540 Yahoo, other news services, etc.  You will need to puzzle
 541 over the HTML source, and scrutinize the output of
 542 C<$tree-E<gt>dump> on the parse tree of that HTML.
 543
 544 Sometimes the only way to pin down what you're after is by position in
 545 the tree. For example, headlines of interest may be in the third
 546 column of the second row of the second table element in a page:
 547
 548   my $table = ( $tree->look_down('_tag','table') )[1];
 549   my $row2  = ( $table->look_down('_tag', 'tr' ) )[1];
 550   my $col3  = ( $row2->look-down('_tag', 'td')   )[2];
 551   ...then do things with $col3...
 552
 553 Or they may be all the links in a "p" element that has at least three
 554 "br" elements as children:
 555
 556   my $p = $tree->look_down(
 557     '_tag', 'p',
 558     sub {
 559       2 < grep { ref($_) and $_->tag eq 'br' }
 560                $_[0]->content_list
 561     }
 562   );
 563   @links = $p->look_down('_tag', 'a');
 564
 565 But almost always, you can get away with looking for properties of the
 566 of the thing itself, rather than just looking for contexts.  Now, if
 567 you're lucky, the document you're looking through has clear semantic
 568 tagging, such is as useful in CSS -- note the
 569 class="headlinelink" bit here:
 570
 571   <a href="...long_news_url..." class="headlinelink">Elvis
 572   seen in tortilla</a>
 573
 574 If you find anything like that, you could leap right in and select
 575 links with:
 576
 577   @links = $tree->look_down('class','headlinelink');
 578
 579 Regrettably, your chances of seeing any sort of semantic markup
 580 principles really being followed with actual HTML are pretty thin.
 581
 582 =over
 583
 584 Footnote:
 585 In fact, your chances of finding a page that is simply free of HTML
 586 errors are even thinner.  And surprisingly, sites like Amazon or Yahoo
 587 are typically worse as far as quality of code than personal sites
 588 whose entire production cycle involves simply being saved and uploaded
 589 from Netscape Composer.
 590
 591 =back
 592
 593 The code may be sort of "accidentally semantic", however -- for example,
 594 in a set of pages I was scanning recently, I found that looking for
 595 "td" elements with a "width" attribute value of "375" got me exactly
 596 what I wanted.  No-one designing that page ever conceived of
 597 "width=375" as I<meaning> "this is a headline", but if you impute it
 598 to mean that, it works.
 599
 600 An approach like this happens to work for the Yahoo News code, because
 601 the headline-links are distinguished by the fact that they (and they
 602 alone) contain a "b" element:
 603
 604   <a href="...long_news_url..."><b>Elvis seen in tortilla</b></a>
 605
 606 or, diagrammed as a part of the parse tree:
 607
 608   . a  [href="...long_news_url..."]
 609     . b
 610       . "Elvis seen in tortilla"
 611
 612 A rule that matches these can be formalized as "look for any 'a'
 613 element that has only one daugher node, which must be a 'b' element".
 614 And this is what it looks like when cooked up as a C<look_down>
 615 expression and prefaced with a bit of code that retrieves the text of
 616 the given Yahoo News page and feeds it to TreeBuilder:
 617
 618   use strict;
 619   use HTML::TreeBuilder 2.97;
 620   use LWP::UserAgent;
 621   sub get_headlines {
 622     my $url = $_[0] || die "What URL?";
 623
 624     my $response = LWP::UserAgent->new->request(
 625       HTTP::Request->new( GET => $url )
 626     );
 627     unless($response->is_success) {
 628       warn "Couldn't get $url: ", $response->status_line, "\n";
 629       return;
 630     }
 631
 632     my $tree = HTML::TreeBuilder->new();
 633     $tree->parse($response->content);
 634     $tree->eof;
 635
 636     my @out;
 637     foreach my $link (
 638       $tree->look_down(   # !
 639         '_tag', 'a',
 640         sub {
 641           return unless $_[0]->attr('href');
 642           my @c = $_[0]->content_list;
 643           @c == 1 and ref $c[0] and $c[0]->tag eq 'b';
 644         }
 645       )
 646     ) {
 647       push @out, [ $link->attr('href'), $link->as_text ];
 648     }
 649
 650     warn "Odd, fewer than 6 stories in $url!" if @out < 6;
 651     $tree->delete;
 652     return @out;
 653   }
 654
 655 ...and add a bit of code to actually call that routine and display the
 656 results...
 657
 658   foreach my $section (qw[tc sc hl wl en]) {
 659     my @links = get_headlines(
 660       "http://dailynews.yahoo.com/h/$section/"
 661     );
 662     print
 663       $section, ": ", scalar(@links), " stories\n",
 664       map(("  ", $_->[0], " : ", $_->[1], "\n"), @links),
 665       "\n";
 666   }
 667
 668 And we've got our own headline-extractor service!  This in and of
 669 itself isn't no amazingly useful (since if you want to see the
 670 headlines, you I<can> just look at the Yahoo News pages), but it could
 671 easily be the basis for quite useful features like filtering the
 672 headlines for matching certain keywords of interest to you.
 673
 674 Now, one of these days, Yahoo News will decide to change its HTML
 675 template.  When this happens, this will appear to the above program as
 676 there being no links that meet the given criteria; or, less likely,
 677 dozens of erroneous links will meet the criteria.  In either case, the
 678 criteria will have to be changed for the new template; they may just
 679 need adjustment, or you may need to scrap them and start over.
 680
 681 =head2 I<Regardez, duvet!>
 682
 683 It's often quite a challenge to write criteria to match the desired
 684 parts of an HTML parse tree.  Very often you I<can> pull it off with a
 685 simple C<$tree-E<gt>look_down('_tag', 'h1')>, but sometimes you do
 686 have to keep adding and refining criteria, until you might end up with
 687 complex filters like what I've shown in this article.  The
 688 benefit to learning how to deal with HTML parse trees is that one main
 689 search tool, the C<look_down> method, can do most of the work, making
 690 simple things easy, while still making hard things possible.
 691
 692 B<[end body of article]>
 693
 694 =head2 [Author Credit]
 695
 696 Sean M. Burke (C<sburke@cpan.org>) is the current maintainer of
 697 C<HTML::TreeBuilder> and C<HTML::Element>, both originally by
 698 Gisle Aas.
 699
 700 Sean adds: "I'd like to thank the folks who listened to me ramble
 701 incessantly about HTML::TreeBuilder and HTML::Element at this year's Yet
 702 Another Perl Conference and O'Reilly Open Source Software Convention."
 703
 704 =head1 BACK
 705
 706 Return to the L<HTML::Tree|HTML::Tree> docs.
 707
 708 =cut
 709