1 2007-01-12 Gisle Aas <gisle@ActiveState.com>
5 Cloning of parser state for compatiblity with threads.
6 Fixed by Bo Lindbergh <blgl@hagernas.com>.
8 Don't require whitespace between declaration tokens.
9 <http://rt.cpan.org/Ticket/Display.html?id=20864>
13 2006-07-10 Gisle Aas <gisle@ActiveState.com>
17 Treat <> at the end of document as text. Used to be
18 reported as a comment.
20 Improved Firefox compatiblity for bad HTML:
21 - Unclosed <script>, <style> are now treated as empty tags.
22 - Unclosed <textarea>, <xmp> and <plaintext> treat rest as text.
23 - Unclosed <title> closes at next tag.
25 Make <!a'b> a comment by itself.
29 2006-04-28 Gisle Aas <gisle@ActiveState.com>
33 Yaakov Belch discovered yet another issue with <script> parsing.
34 Enabling of 'empty_element_tags' got the parser confused
35 if it found such a tag for elements that are normally parsed
36 in literal mode. Of these <script src="..."/> is the only
37 one likely to be found in documents.
38 <http://rt.cpan.org//Ticket/Display.html?id=18965>
42 2006-04-27 Gisle Aas <gisle@ActiveState.com>
46 When ignore_element was enabled it got confused if the
47 corresponding tags did not nest properly; the end tag
48 was treated it as if it was a start tag.
49 Found and fixed by Yaakov Belch <code@yaakovnet.net>.
50 <http://rt.cpan.org/Ticket/Display.html?id=18936>
54 2006-04-26 Gisle Aas <gisle@ActiveState.com>
58 Make sure the 'start_document' fires exactly once for
59 each document parsed. For earlier releases it did not
60 fire at all for empty documents and could fire multiple
61 times if parse was called with empty chunks.
63 Documentation tweaks and typo fixes.
67 2006-03-22 Gisle Aas <gisle@ActiveState.com>
71 Named entities outside the Latin-1 range are now only expanded
72 when properly terminated with ";". This makes HTML::Parser
73 compatible with Firefox/Konqueror/MSIE when it comes to how these
74 entities are expanded in attribute values. Firefox does expand
75 unterminated non-Latin-1 entities in plain text, so here
76 HTML::Parser only stays compatible with Konqueror/MSIE.
77 Fixes <http://rt.cpan.org/Ticket/Display.html?id=17962>.
79 Fixed some documentation typos spotted by <william@knowmad.com>.
80 <http://rt.cpan.org/Ticket/Display.html?id=18062>
84 2006-02-14 Gisle Aas <gisle@ActiveState.com>
88 The 3.49 release didn't compile with VC++ because it mixed code
89 and declarations. Fixed by Steve Hay <steve.hay@uk.radan.com>.
93 2006-02-08 Gisle Aas <gisle@ActiveState.com>
97 Events could sometimes still fire after a handler has signaled eof.
99 Marked_sections with text ending in square bracket parsed wrong.
100 Fix provided by <paul.bijnens@xplanation.com>.
101 <http://rt.cpan.org/Ticket/Display.html?id=16749>
105 2005-12-02 Gisle Aas <gisle@ActiveState.com>
109 Enabling empty_element_tags by default for HTML::TokeParser
110 was a mistake. Reverted that change.
111 <http://rt.cpan.org/Ticket/Display.html?id=16164>
113 When processing a document with "marked_sections => 1", the
114 skipped text missed the first 3 bytes "<![".
115 <http://rt.cpan.org/Ticket/Display.html?id=16207>
119 2005-11-22 Gisle Aas <gisle@ActiveState.com>
123 Added empty_element_tags and xml_pic configuration
124 options. These make it possible to enable these XML
125 features without enabling the full XML-mode.
127 The empty_element_tags is enabled by default for
132 2005-10-24 Gisle Aas <gisle@ActiveState.com>
136 Don't try to treat an literal as space.
137 This breaks Unicode parsing.
138 <http://rt.cpan.org/Ticket/Display.html?id=15068>
140 The unbroken_text option is now on by default
141 for HTML::TokeParser.
143 HTML::Entities::encode will now encode "'" by default.
145 Improved report/ignore_tags documentation by
146 Norbert Kiesel <nkiesel@tbdnetworks.com>.
148 Test suite now use Test::More, by
149 Norbert Kiesel <nkiesel@tbdnetworks.com>.
151 Fix HTML::Entities typo spotted by
152 Stefan Funke <bundy@adm.arcor.net>.
154 Faster load time with XSLoader (perl-5.6 or better now required).
156 Fixed POD markup errors in some of the modules.
160 2005-01-06 Gisle Aas <gisle@ActiveState.com>
164 Fix stack memory leak caused by missing PUTBACK. Only
165 code that used $p->parse(\&cb) form was affected.
166 Fix provided by Gurusamy Sarathy <gsar@sophos.com>.
170 2004-12-28 Gisle Aas <gisle@ActiveState.com>
174 Fix confusion about nested quotes in <script> and <style> text.
178 2004-12-06 Gisle Aas <gisle@ActiveState.com>
182 The SvUTF8 flag was not propagated correctly when replacing
183 unterminated entities.
185 Fixed test failure because of missing binmode on Windows.
189 2004-12-04 Gisle Aas <gisle@ActiveState.com>
193 Avoid sv_catpvn_utf8_upgrade() as that macro was not
194 available in perl-5.8.0.
195 Patch by Reed Russell <Russell.Reed@acxiom.com>.
197 Add casts to suppress compilation warnings for char/U8
200 HTML::HeadParser will always push new header values.
201 This make sure we never loose old header values.
205 2004-11-30 Gisle Aas <gisle@ActiveState.com>
209 Fix unresolved symbol error with perl-5.005.
213 2004-11-29 Gisle Aas <gisle@ActiveState.com>
217 Make utf8_mode only available on perl-5.8 or better. It produced
218 garbage with older versions of perl.
220 Emit warning if entities are decoded and something in the first
221 chunk looks like hibit UTF-8. Previously this warning was only
222 triggered for documents with BOM.
226 2004-11-23 Gisle Aas <gisle@ActiveState.com>
230 More documentation of the Unicode issues. Moved around HTML::Parser
233 New boolean option; $p->utf8_mode to allow parsing of raw UTF-8.
235 Documented that HTML::Entities::decode_entities() can take multiple
238 Unterminated entities are now decoded in text (compatibility
239 with MSIE misfeature).
241 Document HTML::Entities::_decode_entities(); this variation of the
242 decode_entities() function has been available for a long time, but
243 have not been documented until now.
245 HTML::Entities::_decode_entities() can now be told to try to
246 expand unterminated entities.
248 Simplified Makefile.PL
252 2004-11-23 Gisle Aas <gisle@ActiveState.com>
256 The HTML::HeadParser will skip Unicode BOM. Previously it
257 would consider the <head> section done when it saw the BOM.
259 The parser will look for Unicode BOM and give appropriate
260 warnings if the form found indicate trouble.
262 If no matching end tag is found for <script>, <style>, <xmp>
263 <title>, <textarea> then generate one where the next tag
266 For <script> and <style> recognize quoted strings and don't
267 consider end element if the corresponding end tag is found
268 inside such a string.
272 2004-11-17 Gisle Aas <gisle@ActiveState.com>
276 The <title> element is now parsed in literal mode, which
277 means that other tags are not recognized until </title> has
280 Unicode support for perl-5.8 and better.
282 Decoding Unicode entities always enabled; no longer a compile
285 Propagation of UTF8 state on strings.
286 Patch contributed by John Gardiner Myers <jgmyers@proofpoint.com>.
288 Calculate offsets and lengths in chars for Unicode strings.
290 Fixed link typo in the HTML::TokeParser documentation.
294 2004-11-11 Gisle Aas <gisle@ActiveState.com>
298 New boolean option; $p->closing_plaintext
299 Contributed by Alex Kapranoff <alex@kapranoff.ru>
303 2004-11-10 Gisle Aas <gisle@ActiveState.com>
307 Improved handling of HTML encoded surrogate pairs and illegally
308 endoded Unicode; <http://rt.cpan.org/Ticket/Display.html?id=7785>.
309 Patch by John Gardiner Myers <jgmyers@proofpoint.com>.
311 Avoid generating bad UTF8 strings when decoding entities
312 representing chars beyond #255 in 8-bit strings. Such bad
313 UTF8 sometimes made perl-5.8.5 and older segfault.
315 Undocument v2 style subclassing in synopsis section.
319 Make 'gcc -Wall' happier.
321 Avoid modification of PVs during parsing of attrspec.
322 Another patch by John Gardiner Myers.
326 2004-04-01 Gisle Aas <gisle@ActiveState.com>
330 Improved MSIE/Mozilla compatibility. If the same attribute
331 name repeats for a start tag, use the first value instead
332 of the last. Patch by Nick Duffek <html-parser@duffek.com>.
333 <http://rt.cpan.org/Ticket/Display.html?id=5472>
337 2003-12-12 Gisle Aas <gisle@ActiveState.com>
341 Documentation fixes by Paul Croome <Paul.Croome@softwareag.com>.
343 Removed redundant dSP.
347 2003-10-27 Gisle Aas <gisle@ActiveState.com>
351 Fix segfault that happened when the parse callback caused
352 the stack to get reallocated. The original bug report was
353 <http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=217616>
357 2003-10-14 Gisle Aas <gisle@ActiveState.com>
361 Perl 5.005 or better is now required. For some reason we get
362 a test failure with perl-5.004 and I don't really feel like
363 debugging that perl any more. Details about this failure can
364 be found at <http://rt.cpan.org/Ticket/Display.html?id=4065>.
366 New HTML::TokeParser method called 'get_phrase'. It returns
367 all current text while ignoring any phrase-level markup.
369 The HTML::TokeParser method 'get_text' now expands skipped
370 non-phrase-level tags as a single space.
374 2003-10-10 Gisle Aas <gisle@ActiveState.com>
378 If the document parsed ended with some kind of unterminated markup,
379 then the parser state was not reset properly and this piece of markup
380 would show up in the beginning of the next document parsed.
381 <http://rt.cpan.org/Ticket/Display.html?id=3954>
383 The get_text and get_trimmed_text methods of HTML::TokeParser can
384 now take multiple end tags as argument. Patch by <siegmann@tinbergen.nl>
385 at <http://rt.cpan.org/Ticket/Display.html?id=3166>.
387 Various documentation tweaks.
389 Included another example program: hdump
393 2003-08-19 Gisle Aas <gisle@ActiveState.com>
397 The -DDEBUGGING fix in 3.30 was not really there :-(
401 2003-08-17 Gisle Aas <gisle@ActiveState.com>
405 The previous release failed to compile on a -DDEBUGGING perl
406 like the one provided by Redhat 9.
408 Got rid of references to perl-5.7.
410 Further fixes to avoid warnings from Visual C.
411 Patch by Steve Hay <steve.hay@uk.radan.com>.
415 2003-08-14 Gisle Aas <gisle@ActiveState.com>
419 Setting xml_mode now implies strict_names also for end tags.
421 Avoid warning from Visual C. Patch by <gsar@activestate.com>.
423 64-bit fix from Doug Larrick <doug@ties.org>
424 http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=195500
426 Try to parse similar to Mozilla/MSIE in certain edge cases.
427 All these are outside of the official definition of HTML but
428 HTML spam often tries to take advantage of these.
430 - New configuration attribute 'strict_end'. Unless enabled
431 we will allow end tags to contain extra words or stuff
432 that look like attributes before the '>'. This means that
435 </foo foo="<ignored>">
439 are now all parsed as a 'foo' end tag instead of text.
440 Even if the extra stuff looks like attributes they will not
441 be reported if requested via the 'attr' or 'tokens' argspecs
442 for the 'end' handler.
444 - Parse '</:comment>' and '</ comment>' as comments unless
445 strict_comment is enabled. Previous versions of the parser
446 would report these as text. If these comments contain
447 quoted words prefixed by space or '=' these words can
448 contain '>' without terminating the comment.
450 - Parse '<! "<>" foo>' as comment containing ' "<>" foo'.
451 Previous versions of the parser would terminate the comment
452 at the first '>' and report the rest as text.
454 - Legacy comment mode: Parse with comments terminated with a
455 lone '>' if no '-->' is found before eof.
457 - Incomplete tag at eof is reported as a 'comment' instead
458 of 'text' unless strict_comment is enabled.
462 2003-04-16 Gisle Aas <gisle@ActiveState.com>
466 When 'strict_comment' is off (which it is by default)
467 treat anything that matches <!...> a comment.
469 Should now be more efficient on threaded perls.
473 2003-01-18 Gisle Aas <gisle@ActiveState.com>
477 Typo fixes to the documentation.
479 HTML::Entities::escape_entities_numeric contributed
480 by Sean M. Burke <sburke@cpan.org>.
482 Included one more example program 'hlc' that show
483 how to downcase all tags in an HTML file.
487 2002-03-17 Gisle Aas <gisle@ActiveState.com>
491 Avoid core dump in some cases where the callback croaks.
492 The perl_call_method and perl_call_sv needs G_EVAL flag
495 New parser attributes; 'attr_encoded' and 'case_sensitive'.
496 Contributed by Guy Albertelli II <guy@albertelli.com>.
499 - don't encode \r by default as suggested by Sean M. Burke.
502 - ignore empty http-equiv
503 - allow multiple <link> elements. Patch by
504 Timur I. Bakeyev <timur@gnu.org>
506 Avoid warnings from bleadperl on the uentities test.
510 2001-05-11 Gisle Aas <gisle@ActiveState.com>
514 Minor tweaks for build failures on perl5.004_04, perl-5.6.0,
515 and for macro clash under Windows.
517 Improved parsing of <plaintext>... :-)
521 2001-05-09 Gisle Aas <gisle@ActiveState.com>
527 New events: start_document, end_document
529 New argspecs: skipped_text, offset_end
531 The offset/line/column counters was not properly reset
536 2001-05-01 Gisle Aas <gisle@ActiveState.com>
540 If the $p->ignore_elements filter did not work as it should if
541 handlers for start/end events was not registered.
545 2001-04-17 Gisle Aas <gisle@ActiveState.com>
549 The <textarea> element is now parsed in literal mode, i.e. no other tags
550 recognized until the </textarea> tag is seen. Unlike other literal elements,
551 the text content is not 'cdata'.
553 The XML ' entity is decoded. It apos-char itself is still encoded as
554 ' as ' is not really an HTML tag, and not recognized by many HTML
559 2001-04-10 Gisle Aas <gisle@ActiveState.com>
563 Fix a memory leak which occured when using filter methods.
565 Avoid a few compiler warnings (DEC C):
566 - Trailing comma found in enumerator list
567 - "unsigned char" is not compatible with "const char".
573 2001-04-02 Gisle Aas <gisle@ActiveState.com>
577 Some minor documentation updates.
581 2001-03-30 Gisle Aas <gisle@ActiveState.com>
585 Implemented 'tag', 'line', 'column' argspecs.
587 HTML::PullParser doc update.
588 eg/hform is an example of HTML::PullParser usage.
592 2001-03-27 Gisle Aas <gisle@ActiveState.com>
596 Shorten 'report_only_tags' to 'report_tags'.
597 I think it reads better.
599 Bleadperl portability fixes.
603 2001-03-25 Gisle Aas <gisle@ActiveState.com>
607 HTML::HeadParser made more efficient by using 'ignore_elements'.
609 HTML::LinkExtor made more efficient by using 'report_only_tags'.
611 HTML::TokeParser generalized into HTML::PullParser. HTML::PullParser
612 only support the get_token/unget_token interface of HTML::TokeParser,
613 but is more flexible because the information that make up an token
614 is customisable. HTML::TokeParser is made into an HTML::PullParser
619 2001-03-19 Gisle Aas <gisle@ActiveState.com>
623 Array references can be passed to the filter methods. Makes it easier
624 to use them as constructor options.
626 Example programs updated to use filters.
628 Reset ignored_element state on EOF.
630 Documentation updates.
632 The netscape_buggy_comment() method now generates mandatory warning
633 about its deprecation.
637 2001-03-13 Gisle Aas <gisle@ActiveState.com>
641 This is an developer only release. It contains some new
642 experimental features. The interface to these might still change.
644 Implemented filters to reduce the numbers of callbacks generated:
646 - $p->report_only_tags()
647 - $p->ignore_elements()
649 New @attr argspec. Less overhead than 'attr' and allow
650 compatibility with XML::Parser style start events.
652 The whole argspec can be wrapped up in @{...} to signal
653 flattening. Only makes a difference when the target is an
658 2001-03-09 Gisle Aas <gisle@ActiveState.com>
662 Avoid the entity2char global. That should make the module
663 more thread safe. Patch by Gurusamy Sarathy <gsar@ActiveState.com>.
667 2001-02-24 Gisle Aas <gisle@ActiveState.com>
671 There was a C++ style comment left in util.c. Strict C
672 compilers do not like that kind of stuff.
676 2001-02-23 Gisle Aas <gisle@ActiveState.com>
680 The 3.16 release broke MULTIPLICITY builds. Fixed.
684 2001-02-22 Gisle Aas <gisle@ActiveState.com>
688 The unbroken_text option now works across ignored tags.
690 Fix casting of pointers on some 64 bit platforms.
692 Fix decoding of Unicode entities. Only optionally available for
693 perl-5.7.0 or better.
695 Expose internal decode_entities() function at the Perl level.
697 Reindented some code.
701 2000-12-26 Gisle Aas <gisle@ActiveState.com>
705 HTML::TokeParser's get_tag() method now takes multiple
706 tags to match. Hopefully the documentation is also a bit clearer.
708 #define PERL_NO_GET_CONTEXT: Should speed up things for thread
709 enabled versions of perl.
711 Quote some more entities that also happens to be perl keywords.
712 This avoids warnings on perl-5.004.
714 Unicode entities only triggered for perl-5.7.0 or higher.
718 2000-12-03 Gisle Aas <gisle@ActiveState.com>
722 If a handler triggered by flushing text at eof called the
723 eof method then infinite recursion occurred. Fixed.
724 Bug discovered by Jonathan Stowe <gellyfish@gellyfish.com>.
726 Allow <!doctype ...> to be parsed as declaration.
730 2000-09-17 Gisle Aas <gisle@ActiveState.com>
734 Experimental support for decoding of Unicode entities.
738 2000-09-14 Gisle Aas <gisle@ActiveState.com>
742 Some tweaks to get it to compile with "Optimierender Microsoft (R)
743 32-Bit C/C++-Compiler, Version 12.00.8168, fuer x86."
744 Patch by Matthias Waldorf <matthias.waldorf@zoom.de>.
746 HTML::Entities documentation spelling patch by
747 David Dyck <dcd@tc.fluke.com>.
751 2000-08-22 Gisle Aas <gisle@ActiveState.com>
755 HTML::LinkExtor and eg/hrefsub now obtain %linkElements from
756 the HTML::Tagset module.
760 2000-06-29 Gisle Aas <gisle@ActiveState.com>
764 Avoid core dump when stack gets relocated as the result of
765 text handler invocation while $p->unbroken_text is enabled.
766 Needed to refresh the stack pointer.
770 2000-06-28 Gisle Aas <gisle@ActiveState.com>
774 Avoid core dump if somebody clobbers the aliased $self argument of
777 HTML::TokeParser documentation update suggested by
778 Paul Makepeace <Paul.Makepeace@realprogrammers.com>.
782 2000-05-23 Gisle Aas <gisle@ActiveState.com>
786 Fix core dump for large start tags.
787 Bug spotted by Alexander Fraser <green795@hotmail.com>
789 Added yet another example program: eg/hanchors
791 Typo fix by Jamie McCarthy <jamie@mccarthy.org>
795 2000-03-20 Gisle Aas <gisle@aas.no>
799 Fix perl5.004 builds (was broken in 3.06)
801 Declaration parsing mode now only triggers for <!DOCTYPE ...> and
802 <!ENTITY ...>. Based on patch by la mouton <kero@3sheep.com>.
806 2000-03-06 Gisle Aas <gisle@aas.no>
810 Multi-threading/MULTIPLICITY compilation fix.
811 Both Doug MacEachern <dougm@pobox.com> and
812 Matthias Urlichs <smurf@noris.net> provided a patch.
814 Avoid some "statement not reached" warnings from picky
817 Remove final commas in enums as ANSI C does not allow
818 them and some compilers actually care.
819 Patch by James Walden <jamesw@ichips.intel.com>
821 Added eg/htextsub example program.
825 2000-01-22 Gisle Aas <gisle@aas.no>
829 Implemented $p->unbroken_text option
831 Don't parse content of certain HTML elements as CDATA when
834 Offset was reported with wrong sign for text at end of chunk.
838 2000-01-15 Gisle Aas <gisle@aas.no>
842 Backed out 3.03-patch that checked for legal handler and attribute
843 names in the HTML::Parser constructor.
845 Documentation typo fixed by Michael.
849 2000-01-14 Gisle Aas <gisle@aas.no>
853 We did not get out of comment mode for comments ending with an
854 odd number of "-" before ">". Patch by la mouton <kero@3sheep.com>
856 Documentation patch by Michael.
860 1999-12-21 Gisle Aas <gisle@aas.no>
864 Hide ~-magic IV-pointer to 'struct p_state' behind a reference.
865 This allow copying of the internal _hparser_xs_state element, and
866 will make HTML-Tree-0.61 work again.
868 Introduced $p->init() which might be useful for subclasses that
869 only want the initialization part of the constructor.
871 Filled out DIAGNOSTICS section of the HTML::Parser POD.
875 1999-12-19 Gisle Aas <gisle@aas.no>
879 Rely on ~-magic instead of a DESTROY method to deallocate
880 the internal 'struct p_state'. This avoid memory leaks
881 when people simply wipe of the content of the object hash.
883 One of the assertion in hparser.c had opposite logic. This made
884 the parser fail when compiled with a -DDEBUGGING perl.
886 Don't assume any specific order of hash keys in the t/cases.t.
887 This test failed with some newer development releases of perl.
891 1999-12-14 Gisle Aas <gisle@aas.no>
895 Documentation update (most of it from Michael)
897 Minor patch to eg/hstrip so that it use a "" handler
900 Test suite patches from Michael
904 1999-12-13 Gisle Aas <gisle@aas.no>
908 Patches from Michael:
910 - A handler of "" means that the event will be ignored.
911 More efficient than using 'sub {}' as handler.
913 - Don't use a perl hash for looking up argspec keywords.
915 - Documentation tweaks.
919 1999-12-09 Gisle Aas <gisle@aas.no>
921 Release 2.99_95 (this is a 3.00 candidate)
923 Fixed core dump when "<" was followed by an 8-bit character.
924 Spotted and test case provided by Doug MacEachern. Doug had
925 been running HTML-Parser-XS through more that 1 million urls that
926 had been downloaded via LWP.
928 Handlers can now invoke $p->eof to request the parsing to terminate.
929 HTML::HeadParser has been simplified by taking advantage of this.
930 Also added a title-extraction example that uses this.
932 Michael once again fixed my bad English in the HTML::Parser
935 netscape_buggy_comment will carp instead of warn
939 Documented that HTML::Filter is depreciated.
941 Made backslash reserved in literal argspec strings.
943 Added several new test scripts.
947 1999-12-08 Gisle Aas <gisle@aas.no>
949 Release 2.99_94 (should almost be a 3.00 candidate)
951 Renamed 'cdata_flag' as 'is_cdata'.
953 Dropped support for wrapping callback handler and argspec
954 in an array and passing a reference to $p->handler. It
955 created ambiguities when you want to pass a array as
956 handler destination and not update argspec. The wrapping
957 for constructor arguments are unchanged.
959 Reworked the documentation after updates from Michael.
961 Simplified internal check_handler(). It should probably simply
962 be inlined in handler() again.
964 Added argspec 'length' and 'undef'
966 Fix statement-less label. Fix suggested by Matthew Langford
967 <langfml@Eng.Auburn.EDU>.
969 Added two more example programs: eg/hstrip and eg/htext.
971 Various minor patches from Michael.
975 1999-12-07 Gisle Aas <gisle@aas.no>
981 $p->bool_attr_value renamed as $p->boolean_attribute_value
983 Internal renaming: attrspec --> argspec
985 Introduced internal 'enum argcode' in hparser.c
991 1999-12-05 Gisle Aas <gisle@aas.no>
995 More documentation patches from Michael
997 Renamed 'token1' as 'token0' as suggested by Michael
999 For artificial end tags we now report 'tokens', but not 'tokenpos'.
1001 Boolean attribute values show up as (0, 0) in 'tokenpos' now.
1003 If $p->bool_attr_value is set it will influence 'tokens'
1005 Fix for core dump when parsing <a "> when $p->strict_names(0).
1006 Based on fix by Michael.
1008 Will av_extend() the tokens/tokenspos arrays.
1010 New test suite script by Michael: t/attrspec.t
1014 1999-12-04 Gisle Aas <gisle@aas.no>
1018 Implemented attrspec 'offset'
1020 Documentation patch from Michael
1022 Some more cleanup/updated TODO
1026 1999-12-03 Gisle Aas <gisle@aas.no>
1028 Release 2.99_90 (first beta for 3.00)
1030 Using "realloc" as a parameter name in grow_tokens created
1031 problems for some people. Fix by Paul Schinder <schinder@pobox.com>
1033 Patch by Michael that makes array handler destinations really work.
1035 Patch by Michael that make HTML::TokeParser use this. This gave a
1036 a speedup of about 80%.
1038 Patch by Michael that makes t/cases into a real test.
1040 Small HTML::Parser documentation patch by Michael.
1042 Renamed attrspec 'origtext' to 'text' and 'decoded_text' to 'dtext'
1044 Split up Parser.xs. Moved stuff into hparser.c and util.c
1046 Dropped html_ prefix from internal parser functions.
1048 Renamed internal function html_handle() as report_event().
1052 1999-12-02 Gisle Aas <gisle@aas.no>
1056 HTML::Parser documentation patch from Michael.
1058 Fix memory leaks in html_handler()
1060 Patch that makes an array legal as handler destination.
1063 The end of marked sections does not eat successive newline
1066 The artificial end event for empty tag in xml_mode did not
1067 report an empty origtext.
1069 New constructor option: 'api_version'
1073 1999-12-01 Gisle Aas <gisle@aas.no>
1077 Support "event" in argspec. It expands to the name of the
1078 handler (minus "default").
1080 Fix core dump for large start tags. The tokens_grow() routine
1081 needed an adjustment. Added test for this; t/largstags.t.
1085 1999-11-30 Gisle Aas <gisle@aas.no>
1089 Major restructuring/simplification of callback interface based on
1090 initial work by Michael. The main news is that you now need to
1091 tell what arguments you want to be provided to your callbacks.
1093 The following parser options has been eliminated:
1095 $p->decode_text_entities
1103 1999-11-26 Gisle Aas <gisle@aas.no>
1107 Documentation update by Michael A. Chase.
1109 Fix for declaration parsing by Michael A. Chase.
1111 Workaround for perl5.004_05 bug. Can't return &PL_sv_undef.
1115 1999-11-22 Gisle Aas <gisle@aas.no>
1119 New Parser.pm POD based on initial work by Michael A. Chase.
1120 All new features should now be described.
1122 $p->callback(start => undef) will not reset the callback.
1124 $p->xml_mode() did not parse attributes correct because
1125 HCTYPE_NOT_SPACE_EQ_SLASH_GT flag was never set.
1131 1999-11-18 Gisle Aas <gisle@aas.no>
1135 Implemented $p->attr_pos attribute. This causes attr positions
1136 within $origtext of the start tag to be reported instead of the
1137 attribute values. The positions are reported as 4 numbers; end of
1138 previous attr, start of this attr, start of attr value, and end of
1139 attr. This should make substr() manipulations of $origtext easy.
1141 Implemented $p->unbroken_text attribute. This makes sure that
1142 text segments are never broken and given back as separate text
1143 callbacks. It delays text callbacks until some other markup
1144 has been recognized.
1146 More English corrections by Michael A. Chase.
1148 HTML::LinkExtor now recognizes even more URI attributes as
1149 suggested by Sean M. Burke <sburke@netadventure.net>
1151 Completed marked sections support. It is also now a compile
1152 time decision if you want this supported or not. The only
1153 drawback of enabling it should be a possible parsing speed
1154 reduction. I have not measured this yet.
1156 The keys for callbacks initialized in the constructor are now
1157 suffixed with "_cb".
1159 Renamed $p->pass_cbdata to $p->pass_self.
1161 Added magic number to the p_state struct.
1165 1999-11-17 Gisle Aas <gisle@aas.no>
1169 Don't leak $@ modifications from HTML::Parser constructor.
1171 Included HTML::Parser POD.
1173 Marked sections almost work. CDATA and RCDATA should work.
1175 For tags that take us into literal_mode; <script>, <style>,
1176 <xmp>, we did not recognize the end tag unless it was written
1181 1999-11-16 Gisle Aas <gisle@aas.no>
1185 The mkhctype and mkpfunc scripts were using \z inside RE. This
1186 did not work for perl5.004. Replaced them with plain old
1191 1999-11-15 Gisle Aas <gisle@aas.no>
1195 Grammar fixes by Michael A. Chase <mchase@ix.netcom.com>
1197 Some more test suite patches for Win32 by Michael A. Chase
1198 <mchase@ix.netcom.com>
1200 Implemented $p->strict_names attribute. By default we now
1201 allow almost anything in tag and attribute names. This is much
1202 closer to the behaviour of some popular browsers. This allows us
1203 to parse broken tags like this example from the LWP mailing list:
1204 <IMG ALIGN=MIDDLE SRC=newprevlstGr.gif ALT=[PREV LIST] BORDER=0>
1206 Introduced some tables in "hctype.h" and "pfunc.h". These
1207 are built by the corresponding "mk..." script.
1211 1999-11-10 Gisle Aas <gisle@aas.no>
1215 Make Parser.xs compile on perl5.004_05 too.
1217 New callback called 'default'. This will be called for any
1218 document text no other callback shows an interest in.
1220 Patch by Michael A. Chase <mchase@ix.netcom.com> that should
1221 help clean up files for the test suite on Win32.
1223 Can now set up various attributes with key/value pairs passed to
1226 $p->parse_file() will open the file in binmode()
1228 Pass complete processing instruction tag as second argument
1229 to process callback.
1231 New boolean attribute v2_compat. This influences how attributes
1232 are reported for start tags.
1234 HTML::Filter now filters process instructions too.
1236 Faster HTML::LinkExtor by taking advantage of the new
1237 callback interface. The module now also uses URI.pm (instead
1238 of the old URI::URL) to do URI-absolutations.
1240 Faster HTML::TokeParser by taking advantage of new
1245 1999-11-09 Gisle Aas <gisle@aas.no>
1249 Entities in attribute values are now always expanded.
1251 If you set the $p->decode_text_entities to a true value, then
1252 you don't have to decode the text yourself.
1254 In xml_mode we don't report empty element tags as a start tag
1255 with an extra parameter any more. Instead we generate an artificial
1258 'xml_mode' now implies 'keep_case'.
1260 The parser now keeps its own copy of the bool_attr_value value.
1262 Avoid memory leak for text callbacks
1264 Avoid using ERROR as a goto label.
1266 Introduced common internal accessor function for all boolean parser
1269 Tweaks to make Parser.xs compile under perl5.004.
1273 1999-11-08 Gisle Aas <gisle@aas.no>
1277 Internal fast decode_entities(). By using it we are able to make
1278 the HTML::Entities::decode function 6 times faster than the old one
1279 implemented in pure Perl.
1281 $p->bool_attr_value() can be set to influence the value that
1282 boolean attributes will be assigned. The default is to assign
1283 a value identical to the attribute name.
1285 Process instructions are reported as "PI" in @accum
1287 $p->xml_mode(1) modifies how processing instructions are terminated
1288 and allows "/>" at the end of start tags.
1290 Turn off optimizations when compiling with gcc on Solaris. Avoids
1291 what we believe to be a compiler bug. Should probably figure out
1292 which versions of gcc have this bug.
1296 1999-11-05 Gisle Aas <gisle@aas.no>
1300 The previous release did not even compile. I forgot to try 'make test'
1305 1999-11-05 Gisle Aas <gisle@aas.no>
1309 Generalized <XMP>-support to cover all literal parsing. Currently
1310 activated for <script>, <style>, <xmp> and <plaintext>.
1314 1999-11-05 Gisle Aas <gisle@aas.no>
1320 Allow ":" in tag and attribute names
1322 Include rest of the HTML::* files from the old HTML::Parser
1323 package. This should make testing easier.
1327 1999-11-04 Gisle Aas <gisle@aas.no>
1331 Implemented keep_case() option. If this attribute is true, then
1332 we don't lowercase tag and attribute names.
1334 Implemented accum() that takes an array reference. Tokens are
1335 pushed onto this array instead of sent to callbacks.
1337 Implemented strict_comment().
1341 1999-11-03 Gisle Aas <gisle@aas.no>
1345 Baseline of XS implementation
1349 1999-11-05 Gisle Aas <gisle@aas.no>
1353 Allow ":" in attribute names as a workaround for Microsoft Excel
1354 2000 which generates such files.
1356 Make deprecate warning if netscape_buggy_comment() method is
1357 used. The method is used in strict_comment().
1359 Avoid duplication of parse_file() method in HTML::HeadParser.
1363 1999-10-29 Gisle Aas <gisle@aas.no>
1367 $p->parse_file() will not close a handle passed to it any more.
1368 If passed a filename that can't be opened it will return undef
1369 instead of raising an exception, and strings like "*STDIN" are not
1370 treated as globs any more.
1372 HTML::LinkExtor knowns about background attribute of <tables>.
1373 Patch by Clinton Wong <clintdw@netcom.com>
1375 HTML::TokeParser will parse large inline strings much faster now.
1376 The string holding the document must not be changed during parsing.
1380 1999-06-09 Gisle Aas <gisle@aas.no>
1384 Documentation updates.
1388 1998-12-18 Gisle Aas <aas@sn.no>
1392 Protect HTML::HeadParser from evil $SIG{__DIE__} hooks.
1396 1998-11-13 Gisle Aas <aas@sn.no>
1400 HTML::TokeParser can now parse strings directly and does the
1401 right thing if you pass it a GLOB. Based on patch by
1402 Sami Itkonen <si@iki.fi>.
1404 HTML::Parser now allows space before and after "--" in Netscape
1405 comments. Patch by Peter Orbaek <poe@daimi.au.dk>.
1409 1998-07-08 Gisle Aas <aas@sn.no>
1413 Added HTML::TokeParser. Check it out!
1417 1998-07-07 Gisle Aas <aas@sn.no>
1421 Don't end a text chunk with space when we try to avoid breaking up
1426 1998-06-22 Gisle Aas <aas@sn.no>
1430 HTML::HeadParser->parse_file will now stop parsing when the
1431 <body> starts as it should.
1433 HTML::LinkExtor more easily subclassable by introducing the
1434 $self->_found_link method.
1438 1998-04-28 Gisle Aas <aas@sn.no>
1442 Never split words (a sequence of non-space) between two invocations
1443 of $self->text. This is just a simplification of the code that tried
1444 not to break entities.
1446 HTML::Parser->parse_file now use smaller chunks as already
1447 suggested by the HTML::Parser documentation.
1451 1998-04-02 Gisle Aas <aas@sn.no>
1455 The HTML::Parser could some times break hex entites (like )
1458 Removed remaining forced dependencies on libwww-perl modules. It
1459 means that all tests should now pass, even if libwww-perl was not
1460 installed previously.
1466 1998-04-01 Gisle Aas <aas@sn.no>
1468 Release 2.14, HTML::* modules unbundled from libwww-perl-5.22.