7 Text::Diff - Perform diffs on files and record sets
13 ## Mix and match filenames, strings, file handles, producer subs,
14 ## or arrays of records; returns diff in a string.
15 ## WARNING: can return B<large> diffs for large files.
16 my $diff = diff "file1.txt", "file2.txt", { STYLE => "Context" };
17 my $diff = diff \$string1, \$string2, \%options;
18 my $diff = diff \*FH1, \*FH2;
19 my $diff = diff \&reader1, \&reader2;
20 my $diff = diff \@records1, \@records2;
22 ## May also mix input types:
23 my $diff = diff \@records1, "file_B.txt";
27 C<diff()> provides a basic set of services akin to the GNU C<diff> utility. It
28 is not anywhere near as feature complete as GNU C<diff>, but it is better
29 integrated with Perl and available on all platforms. It is often faster than
30 shelling out to a system's C<diff> executable for small files, and generally
31 slower on larger files.
33 Relies on L<Algorithm::Diff> for, well, the algorithm. This may not produce
34 the same exact diff as a system's local C<diff> executable, but it will be a
35 valid diff and comprehensible by C<patch>. We haven't seen any differences
36 between Algorithm::Diff's logic and GNU diff's, but we have not examined them
37 to make sure they are indeed identical.
39 B<Note>: If you don't want to import the C<diff> function, do one of the
46 That's a pretty rare occurence, so C<diff()> is exported by default.
51 @ISA = qw( Exporter );
56 use Algorithm::Diff qw( traverse_sequences );
58 ## Hunks are made of ops. An op is the starting index for each
59 ## sequence and the opcode:
60 use constant A => 0; # Array index before match/discard
62 use constant OPCODE => 2; # "-", " ", "+"
63 use constant FLAG => 3; # What to display if not OPCODE "!"
68 diff() takes two parameters from which to draw input and a set of
69 options to control it's output. The options are:
73 =item FILENAME_A, MTIME_A, FILENAME_B, MTIME_B
75 The name of the file and the modification time "files"
77 These are filled in automatically for each file when diff() is passed a
78 filename, unless a defined value is passed in.
80 If a filename is not passed in and FILENAME_A and FILENAME_B are not provided
81 or C<undef>, the header will not be printed.
83 Unused on C<OldStyle> diffs.
85 =item OFFSET_A, OFFSET_B
87 The index of the first line / element. These default to 1 for all
88 parameter types except ARRAY references, for which the default is 0. This
89 is because ARRAY references are presumed to be data structures, while the
90 others are line oriented text.
94 "Unified", "Context", "OldStyle", or an object or class reference for a class
95 providing C<file_header()>, C<hunk_header()>, C<hunk()>, C<hunk_footer()> and
96 C<file_footer()> methods. The two footer() methods are provided for
97 overloading only; none of the formats provide them.
99 Defaults to "Unified" (unlike standard C<diff>, but Unified is what's most
100 often used in submitting patches and is the most human readable of the three.
102 If the package indicated by the STYLE has no hunk() method, c<diff()> will
103 load it automatically (lazy loading). Since all such packages should inherit
104 from Text::Diff::Base, this should be marvy.
106 Styles may be specified as class names (C<STYLE => "Foo"), in which case they
107 will be C<new()>ed with no parameters, or as objects (C<STYLE => Foo->new>).
111 How many lines before and after each diff to display. Ignored on old-style
112 diffs. Defaults to 3.
116 Examples and their equivalent subroutines:
118 OUTPUT => \*FOOHANDLE, # like: sub { print FOOHANDLE shift() }
119 OUTPUT => \$output, # like: sub { $output .= shift }
120 OUTPUT => \@output, # like: sub { push @output, shift }
121 OUTPUT => sub { $output .= shift },
123 If no C<OUTPUT> is supplied, returns the diffs in a string. If
124 C<OUTPUT> is a C<CODE> ref, it will be called once with the (optional)
125 file header, and once for each hunk body with the text to emit. If
126 C<OUTPUT> is an L<IO::Handle>, output will be emitted to that handle.
128 =item FILENAME_PREFIX_A, FILENAME_PREFIX_B
130 The string to print before the filename in the header. Unused on C<OldStyle>
131 diffs. Defaults are C<"---">, C<"+++"> for Unified and C<"***">, C<"+++"> for
134 =item KEYGEN, KEYGEN_ARGS
136 These are passed to L<Algorithm::Diff/traverse_sequences>.
140 B<Note>: if neither C<FILENAME_> option is defined, the header will not be
141 printed. If at one is present, the other and both MTIME_ options must be
142 present or "Use of undefined variable" warnings will be generated (except
143 on C<OldStyle> diffs, which ignores these options).
147 my %internal_styles = (
151 Table => undef, ## "internal", but in another module
155 my @seqs = ( shift, shift );
156 my $options = shift || {};
162 while ( $type eq "CODE" ) {
163 $seqs[$i] = $seq = $seq->( $options );
167 my $AorB = !$i ? "A" : "B";
169 if ( $type eq "ARRAY" ) {
170 ## This is most efficient :)
171 $options->{"OFFSET_$AorB"} = 0
172 unless defined $options->{"OFFSET_$AorB"};
174 elsif ( $type eq "SCALAR" ) {
175 $seqs[$i] = [split( /^/m, $$seq )];
176 $options->{"OFFSET_$AorB"} = 1
177 unless defined $options->{"OFFSET_$AorB"};
180 $options->{"OFFSET_$AorB"} = 1
181 unless defined $options->{"OFFSET_$AorB"};
182 $options->{"FILENAME_$AorB"} = $seq
183 unless defined $options->{"FILENAME_$AorB"};
184 $options->{"MTIME_$AorB"} = (stat($seq))[9]
185 unless defined $options->{"MTIME_$AorB"};
188 open F, "<$seq" or carp "$!: $seq";
193 elsif ( $type eq "GLOB" || UNIVERSAL::isa( $seq, "IO::Handle" ) ) {
194 $options->{"OFFSET_$AorB"} = 1
195 unless defined $options->{"OFFSET_$AorB"};
197 $seqs[$i] = [<$seq>];
200 confess "Can't handle input of type ", ref;
206 my $output_handler = $options->{OUTPUT};
207 my $type = ref $output_handler ;
208 if ( ! defined $output_handler ) {
210 $output_handler = sub { $output .= shift };
212 elsif ( $type eq "CODE" ) {
213 ## No problems, mate.
215 elsif ( $type eq "SCALAR" ) {
216 my $out_ref = $output_handler;
217 $output_handler = sub { $$out_ref .= shift };
219 elsif ( $type eq "ARRAY" ) {
220 my $out_ref = $output_handler;
221 $output_handler = sub { push @$out_ref, shift };
223 elsif ( $type eq "GLOB" || UNIVERSAL::isa $output_handler, "IO::Handle" ) {
224 my $output_handle = $output_handler;
225 $output_handler = sub { print $output_handle shift };
228 croak "Unrecognized output type: $type";
231 my $style = $options->{STYLE};
232 $style = "Unified" unless defined $options->{STYLE};
233 $style = "Text::Diff::$style" if exists $internal_styles{$style};
235 if ( ! $style->can( "hunk" ) ) {
236 eval "require $style; 1" or die $@;
240 if ! ref $style && $style->can( "new" );
242 my $ctx_lines = $options->{CONTEXT};
243 $ctx_lines = 3 unless defined $ctx_lines;
244 $ctx_lines = 0 if $style->isa( "Text::Diff::OldStyle" );
246 my @keygen_args = $options->{KEYGEN_ARGS}
247 ? @{$options->{KEYGEN_ARGS}}
251 my $diffs = 0; ## Number of discards this hunk
252 my $ctx = 0; ## Number of " " (ctx_lines) ops pushed after last diff.
253 my @ops; ## ops (" ", +, -) in this hunk
254 my $hunks = 0; ## Number of hunks
257 $output_handler->( $style->file_header( @seqs, $options ) )
259 $output_handler->( $style->hunk_header( @seqs, @_, $options ) );
260 $output_handler->( $style->hunk ( @seqs, @_, $options ) );
261 $output_handler->( $style->hunk_footer( @seqs, @_, $options ) );
264 ## We keep 2*ctx_lines so that if a diff occurs
265 ## at 2*ctx_lines we continue to grow the hunk instead
266 ## of emitting diffs and context as we go. We
267 ## need to know the total length of both of the two
268 ## subsequences so the line count can be printed in the
270 my $dis_a = sub {push @ops, [@_[0,1],"-"]; ++$diffs ; $ctx = 0 };
271 my $dis_b = sub {push @ops, [@_[0,1],"+"]; ++$diffs ; $ctx = 0 };
277 push @ops, [@_[0,1]," "];
279 if ( $diffs && ++$ctx > $ctx_lines * 2 ) {
280 $emit_ops->( [ splice @ops, 0, $#ops - $ctx_lines ] );
284 ## throw away context lines that aren't needed any more
285 shift @ops if ! $diffs && @ops > $ctx_lines;
290 $options->{KEYGEN}, # pass in user arguments for key gen function
295 $#ops -= $ctx - $ctx_lines if $ctx > $ctx_lines;
296 $emit_ops->( \@ops );
299 $output_handler->( $style->file_footer( @seqs, $options ) ) if $hunks;
301 return defined $output ? $output : $hunks;
307 my ( $p1, $fn1, $t1, $p2, $fn2, $t2 ) = @{$h}{
316 ## remember to change Text::Diff::Table if this logic is tweaked.
317 return "" unless defined $fn1 && defined $fn2;
320 $p1, " ", $fn1, defined $t1 ? "\t" . localtime $t1 : (), "\n",
321 $p2, " ", $fn2, defined $t2 ? "\t" . localtime $t2 : (), "\n",
325 ## _range encapsulates the building of, well, ranges. Turns out there are
328 my ( $ops, $a_or_b, $format ) = @_;
330 my $start = $ops->[ 0]->[$a_or_b];
331 my $after = $ops->[-1]->[$a_or_b];
333 ## The sequence indexes in the lines are from *before* the OPCODE is
334 ## executed, so we bump the last index up unless the OP indicates
337 unless $ops->[-1]->[OPCODE] eq ( $a_or_b == A ? "+" : "-" );
339 ## convert from 0..n index to 1..(n+1) line number. The unless modifier
340 ## handles diffs with no context, where only one file is affected. In this
341 ## case $start == $after indicates an empty range, and the $start must
342 ## not be incremented.
343 my $empty_range = $start == $after;
344 ++$start unless $empty_range;
348 ? $format eq "unified" && $empty_range
351 : $format eq "unified"
352 ? "$start,".($after-$start+1)
358 my ( $seqs, $op, $a_or_b, $op_prefixes ) = @_;
360 my $opcode = $op->[OPCODE];
361 return () unless defined $op_prefixes->{$opcode};
363 my $op_sym = defined $op->[FLAG] ? $op->[FLAG] : $opcode;
364 $op_sym = $op_prefixes->{$op_sym};
365 return () unless defined $op_sym;
367 $a_or_b = $op->[OPCODE] ne "+" ? 0 : 1 unless defined $a_or_b;
368 return ( $op_sym, $seqs->[$a_or_b][$op->[$a_or_b]] );
372 =head1 Formatting Classes
374 These functions implement the output formats. They are grouped in to classes
375 so diff() can use class names to call the correct set of output routines and so
376 that you may inherit from them easily. There are no constructors or instance
377 methods for these classes, though subclasses may provide them if need be.
379 Each class has file_header(), hunk_header(), hunk(), and footer() methods
380 identical to those documented in the Text::Diff::Unified section. header() is
381 called before the hunk() is first called, footer() afterwards. The default
382 footer function is an empty method provided for overloading:
384 sub footer { return "End of patch\n" }
386 Some output formats are provided by external modules (which are loaded
387 automatically), such as L<Text::Diff::Table>. These are
388 are documented here to keep the documentation simple.
392 =head2 Text::Diff::Base
394 Returns "" for all methods (other than C<new()>).
399 package Text::Diff::Base;
402 return bless { @_ }, ref $proto || $proto;
405 sub file_header { return "" }
406 sub hunk_header { return "" }
407 sub hunk { return "" }
408 sub hunk_footer { return "" }
409 sub file_footer { return "" }
413 =head2 Text::Diff::Unified
415 --- A Mon Nov 12 23:49:30 2001
416 +++ B Mon Nov 12 23:49:30 2001
438 $s = Text::Diff::Unified->file_header( $options );
440 Returns a string containing a unified header. The sole parameter is the
441 options hash passed in to diff(), containing at least:
450 FILENAME_PREFIX_A => "---",
451 FILENAME_PREFIX_B => "+++",
453 to override the default prefixes (default values shown).
457 @Text::Diff::Unified::ISA = qw( Text::Diff::Base );
459 sub Text::Diff::Unified::file_header {
460 shift; ## No instance data
464 { FILENAME_PREFIX_A => "---", FILENAME_PREFIX_B => "+++", %$options }
470 Text::Diff::Unified->hunk_header( \@ops, $options );
472 Returns a string containing the output of one hunk of unified diff.
476 sub Text::Diff::Unified::hunk_header {
477 shift; ## No instance data
478 pop; ## Ignore options
483 _range( $ops, A, "unified" ),
485 _range( $ops, B, "unified" ),
491 =item Text::Diff::Unified::hunk
493 Text::Diff::Unified->hunk( \@seq_a, \@seq_b, \@ops, $options );
495 Returns a string containing the output of one hunk of unified diff.
499 sub Text::Diff::Unified::hunk {
500 shift; ## No instance data
501 pop; ## Ignore options
504 my $prefixes = { "+" => "+", " " => " ", "-" => "-" };
506 return join "", map _op_to_line( \@_, $_, undef, $prefixes ), @$ops
512 =head2 Text::Diff::Table
514 +--+----------------------------------+--+------------------------------+
515 | |../Test-Differences-0.2/MANIFEST | |../Test-Differences/MANIFEST |
516 | |Thu Dec 13 15:38:49 2001 | |Sat Dec 15 02:09:44 2001 |
517 +--+----------------------------------+--+------------------------------+
519 | 1|Differences.pm | 2|Differences.pm |
520 | 2|MANIFEST | 3|MANIFEST |
521 | | * 4|MANIFEST.SKIP *
522 | 3|Makefile.PL | 5|Makefile.PL |
523 | | * 6|t/00escape.t *
524 | 4|t/00flatten.t | 7|t/00flatten.t |
525 | 5|t/01text_vs_data.t | 8|t/01text_vs_data.t |
526 | 6|t/10test.t | 9|t/10test.t |
527 +--+----------------------------------+--+------------------------------+
529 This format also goes to some pains to highlight "invisible" characters on
530 differing elements by selectively escaping whitespace:
532 +--+--------------------------+--------------------------+
533 | |demo_ws_A.txt |demo_ws_B.txt |
534 | |Fri Dec 21 08:36:32 2001 |Fri Dec 21 08:36:50 2001 |
535 +--+--------------------------+--------------------------+
536 | 1|identical |identical |
537 * 2| spaced in | also spaced in *
538 * 3|embedded space |embedded tab *
539 | 4|identical |identical |
540 * 5| spaced in |\ttabbed in *
541 * 6|trailing spaces\s\s\n |trailing tabs\t\t\n *
542 | 7|identical |identical |
543 * 8|lf line\n |crlf line\r\n *
544 * 9|embedded ws |embedded\tws *
545 +--+--------------------------+--------------------------+
547 See L</Text::Diff::Table> for more details, including how the whitespace
550 =head2 Text::Diff::Context
552 *** A Mon Nov 12 23:49:30 2001
553 --- B Mon Nov 12 23:49:30 2001
584 Note: hunk_header() returns only "***************\n".
589 @Text::Diff::Context::ISA = qw( Text::Diff::Base );
591 sub Text::Diff::Context::file_header {
592 _header { FILENAME_PREFIX_A=>"***", FILENAME_PREFIX_B=>"---", %{$_[-1]} };
596 sub Text::Diff::Context::hunk_header {
597 return "***************\n";
600 sub Text::Diff::Context::hunk {
601 shift; ## No instance data
602 pop; ## Ignore options
604 ## Leave the sequences in @_[0,1]
606 my $a_range = _range( $ops, A, "" );
607 my $b_range = _range( $ops, B, "" );
609 ## Sigh. Gotta make sure that differences that aren't adds/deletions
610 ## get prefixed with "!", and that the old opcodes are removed.
612 for ( my $start = 0; $start <= $#$ops ; $start = $after ) {
613 ## Scan until next difference
615 my $opcode = $ops->[$start]->[OPCODE];
616 next if $opcode eq " ";
619 while ( $after <= $#$ops && $ops->[$after]->[OPCODE] ne " " ) {
620 $bang_it ||= $ops->[$after]->[OPCODE] ne $opcode;
625 for my $i ( $start..($after-1) ) {
626 $ops->[$i]->[FLAG] = "!";
631 my $b_prefixes = { "+" => "+ ", " " => " ", "-" => undef, "!" => "! " };
632 my $a_prefixes = { "+" => undef, " " => " ", "-" => "- ", "!" => "! " };
635 "*** ", $a_range, " ****\n",
636 map( _op_to_line( \@_, $_, A, $a_prefixes ), @$ops ),
637 "--- ", $b_range, " ----\n",
638 map( _op_to_line( \@_, $_, B, $b_prefixes ), @$ops ),
641 =head2 Text::Diff::OldStyle
652 Note: no file_header().
656 @Text::Diff::OldStyle::ISA = qw( Text::Diff::Base );
660 my $op = $ops->[0]->[OPCODE];
661 $op = "c" if grep $_->[OPCODE] ne $op, @$ops;
662 $op = "a" if $op eq "+";
663 $op = "d" if $op eq "-";
667 sub Text::Diff::OldStyle::hunk_header {
668 shift; ## No instance data
669 pop; ## ignore options
674 return join "", _range( $ops, A, "" ), $op, _range( $ops, B, "" ), "\n";
677 sub Text::Diff::OldStyle::hunk {
678 shift; ## No instance data
679 pop; ## ignore options
681 ## Leave the sequences in @_[0,1]
683 my $a_prefixes = { "+" => undef, " " => undef, "-" => "< " };
684 my $b_prefixes = { "+" => "> ", " " => undef, "-" => undef };
689 map( _op_to_line( \@_, $_, A, $a_prefixes ), @$ops ),
690 $op eq "c" ? "---\n" : (),
691 map( _op_to_line( \@_, $_, B, $b_prefixes ), @$ops ),
697 Must suck both input files entirely in to memory and store them with a normal
698 amount of Perlish overhead (one array location) per record. This is implied by
699 the implementation of Algorithm::Diff, which takes two arrays. If
700 Algorithm::Diff ever offers an incremental mode, this can be changed (contact
701 the maintainers of Algorithm::Diff and Text::Diff if you need this; it
702 shouldn't be too terribly hard to tie arrays in this fashion).
704 Does not provide most of the more refined GNU diff options: recursive directory
705 tree scanning, ignoring blank lines / whitespace, etc., etc. These can all be
706 added as time permits and need arises, many are rather easy; patches quite
709 Uses closures internally, this may lead to leaks on C<perl> versions 5.6.1 and
710 prior if used many times over a process' life time.
714 Barrie Slaymaker <barries@slaysys.com>.
716 =head1 COPYRIGHT & LICENSE
718 Copyright 2001, Barrie Slaymaker. All Rights Reserved.
720 You may use this under the terms of either the Artistic License or GNU Public
721 License v 2.0 or greater.