Help - Search - Member List - Calendar
Full Version: Intelligent Sorting
WorkTheWeb Forums > Webmaster Resources > Perl Beginner Help
Support our Sponsors!
Ryan Frantz
Perlers,

I'm working on a script that will generate a listing of files on a
regular basis so that I can create hyperlinks to each respective file.
As you see from the sorted output below, though it is in ASCIIbetical
order, it is not in chronological order:

/2005Jul01-2005Jul02/foo/bar.html
/2005Jul05-2005Jul06/foo/bar.html
/2005Jun09-2005Jun10/foo/bar.html
/2005Jun10-2005Jun11/foo/bar.html
/2005Jun13-2005Jun14/foo/bar.html
/2005Jun14-2005Jun15/foo/bar.html

Is there any decent documentation available that I could study so that I
can sort this better?

I thought about grabbing the ctime of each file and sorting on that but
I'm not sure if that would add unnecessary complexity to the script.

Incidentally, these files are created by SARG (Squid log analyzer) and
I've found nothing in the config that lets me customize the naming
convention for the directories.

ry

Jeff 'japhy' Pinyan
On Jul 6, Ryan Frantz said:

QUOTE
I'm working on a script that will generate a listing of files on a
regular basis so that I can create hyperlinks to each respective file.
As you see from the sorted output below, though it is in ASCIIbetical
order, it is not in chronological order:

/2005Jul01-2005Jul02/foo/bar.html
/2005Jul05-2005Jul06/foo/bar.html
/2005Jun09-2005Jun10/foo/bar.html
/2005Jun10-2005Jun11/foo/bar.html
/2005Jun13-2005Jun14/foo/bar.html
/2005Jun14-2005Jun15/foo/bar.html

Is there any decent documentation available that I could study so that I
can sort this better?

I thought about grabbing the ctime of each file and sorting on that but
I'm not sure if that would add unnecessary complexity to the script.

The primary problem is that the dates in the filenames are formatted as
"YYYYmmmDD" rather than "YYYYMMDD". Before sorting the filenames, you
could convert the month NAMES to numerical representations (Jan => 01, Dec
=> 12), and then after you've sorted them (ASCIIbetically will work here)
you can change those numerical representations back to the month names.

--
Jeff "japhy" Pinyan % How can we ever be the sold short or
RPI Acacia Brother #734 % the cheated, we who for every service
http://japhy.perlmonk.org/ % have long ago been overpaid?
http://www.perlmonks.org/ % -- Meister Eckhart

Ryan Frantz wrote:
QUOTE
Perlers,

I'm working on a script that will generate a listing of files on a
regular basis so that I can create hyperlinks to each respective file.
As you see from the sorted output below, though it is in ASCIIbetical
order, it is not in chronological order:

/2005Jul01-2005Jul02/foo/bar.html
/2005Jul05-2005Jul06/foo/bar.html
/2005Jun09-2005Jun10/foo/bar.html
/2005Jun10-2005Jun11/foo/bar.html
/2005Jun13-2005Jun14/foo/bar.html
/2005Jun14-2005Jun15/foo/bar.html

Is there any decent documentation available that I could study so
that I can sort this better?

I thought about grabbing the ctime of each file and sorting on that
but I'm not sure if that would add unnecessary complexity to the
script.

Incidentally, these files are created by SARG (Squid log analyzer) and
I've found nothing in the config that lets me customize the naming
convention for the directories.

ry
Here is one way to approach:

!perl

use strict;
use warnings;

my %AlphaToNbr = qw(jan 1 feb 2 mar 3 apr 4 may 5 jun 6 jul 7 aug 8 sep 9 oct 10 nov 11 dec 12);
foreach my $MySortedFile (sort { $a->[1] <=> $b->[1] or
$AlphaToNbr{lc($a->[2])} <=> $AlphaToNbr{lc($b->[2])} or
$a->[3] <=> $b->[3]
}
map {[$_, /^.(d{4})(w{3})(d{2})/]}
<DATA> ) {
chomp($MySortedFile->[0]);
print $MySortedFile->[0] . "n";
}

__DATA__
/2005Jul01-2005Jul02/foo/bar.html
/2005Jul05-2005Jul06/foo/bar.html
/2005Jun09-2005Jun10/foo/bar.html
/2005Jun10-2005Jun11/foo/bar.html
/2005Jun13-2005Jun14/foo/bar.html
/2005Jun14-2005Jun15/foo/bar.html

Output:
/2005Jun09-2005Jun10/foo/bar.html
/2005Jun10-2005Jun11/foo/bar.html
/2005Jun13-2005Jun14/foo/bar.html
/2005Jun14-2005Jun15/foo/bar.html
/2005Jul01-2005Jul02/foo/bar.html
/2005Jul05-2005Jul06/foo/bar.html

Wags ;)


*******************************************************
This message contains information that is confidential
and proprietary to FedEx Freight or its affiliates.
It is intended only for the recipient named and for
the express purpose(s) described therein.
Any other use is prohibited.
*******************************************************

Ryan Frantz
QUOTE
Is there any decent documentation available that I could study so
that I can sort this better?


As soon as I hit Send on this email I checked my 'Learning Perl' book
and found some information (Ch. 15, not that far yet ;^)). Prior to
implementing anything, I want to understand what's going on.

QUOTE
I thought about grabbing the ctime of each file and sorting on that
but I'm not sure if that would add unnecessary complexity to the
script.

Incidentally, these files are created by SARG (Squid log analyzer)
and
I've found nothing in the config that lets me customize the naming
convention for the directories.

ry
Here is one way to approach:
!perl

use strict;
use warnings;

my %AlphaToNbr = qw(jan 1 feb 2 mar 3 apr 4 may 5 jun 6 jul 7 aug 8
sep 9
oct 10 nov 11 dec 12);
foreach my $MySortedFile (sort { $a->[1] <=> $b->[1]
or
$AlphaToNbr{lc($a->[2])} <=
$AlphaToNbr{lc($b-
[2])} or
$a->[3] <=> $b->[3]
}
map {[$_, /^.(d{4})(w{3})(d{2})/]}
<DATA> ) {
chomp($MySortedFile->[0]);
print $MySortedFile->[0] . "n";
}

__DATA__
/2005Jul01-2005Jul02/foo/bar.html
/2005Jul05-2005Jul06/foo/bar.html
/2005Jun09-2005Jun10/foo/bar.html
/2005Jun10-2005Jun11/foo/bar.html
/2005Jun13-2005Jun14/foo/bar.html
/2005Jun14-2005Jun15/foo/bar.html

Output:
/2005Jun09-2005Jun10/foo/bar.html
/2005Jun10-2005Jun11/foo/bar.html
/2005Jun13-2005Jun14/foo/bar.html
/2005Jun14-2005Jun15/foo/bar.html
/2005Jul01-2005Jul02/foo/bar.html
/2005Jul05-2005Jul06/foo/bar.html

Wags ;)

Thanks for the snippet; I'll dig into this to learn more!

ry

Jeff 'japhy' Pinyan
On Jul 7, Ryan Frantz said:

QUOTE
foreach my $MySortedFile (sort { $a->[1] <=> $b->[1]
or
$AlphaToNbr{lc($a->[2])} <=
$AlphaToNbr{lc($b-
[2])} or
$a->[3] <=> $b->[3]
}
map {[$_, /^.(d{4})(w{3})(d{2})/]}
<DATA> ) {
[snip]
}

1. Create a hash with numeric equivalents for the months.
2. Perform a sort by first comparing numbers (I'm assuming the YYYY, but
I don't quite know how that reference works).
3. Then comparing mmm (having been converted to lower case, the value of
the respective AlphaToNbr key is compared).
4. Compare another pair of numbers (DD?)

After that I'm lost.  I'm not familiar with the 'map' function or what
happens after that.

You have to read it from the bottom up. FIRST the input filehandle is
read (in this case, DATA), and all the lines of input are fed to the map()
function. THEN the map() function returns a list of array references,
whose elements are: the original line, $1, $2, $3 (from the regex match).
This list of array references is then passed to sort(), which sorts them
first by their year, then by the hash-value associated with the lowercase
version of their month, and then by their day.

QUOTE
my %AlphaToNumber = (
jan => "1",
feb => "2",
mar => "3",
apr => "4",
may => "5",
jun => "6",
jul => "7",
aug => "8",
sep => "9",
"oct" => "10",
nov => "11",
dec => "12",
);

# sort chronologically using a code snippet from 'Wags'
@user_links = ( sort {
$a->[1] <=> $b->[1]
or
$AlphaToNumber{lc($a->[2])} <=> $AlphaToNumber{lc($b->[2])}
or
$a->[3] <=> $b->[3]
} @user_links;
map {[$_, /^.(d{4})(w{3})(d{2})/]}
<DATA> );

Ok, you would do:

@user_links = sort {
$a->[1] <=> $b->[1]
or
$AlphaToNumber{lc($a->[2])} <=> $AlphaToNumber{lc($b->[2])}
or
$a->[3] <=> $b->[3]
} map {
[ $_, /^.(d{4})(w{3})(d{2})/ ]
} @user_links;

Here, your @user_links array holds the data that Wags was reading from the
DATA filehandle. We still execute the map() on it, though, because we
need to get from "/2005Jun01-Jun04xxx" to the array reference containing
["/2005Jun01-Jun04xxx", 2005, 'Jun', '04'].

--
Jeff "japhy" Pinyan % How can we ever be the sold short or
RPI Acacia Brother #734 % the cheated, we who for every service
http://japhy.perlmonk.org/ % have long ago been overpaid?
http://www.perlmonks.org/ % -- Meister Eckhart

Ryan Frantz
QUOTE
Here is one way to approach:
!perl

use strict;
use warnings;

my %AlphaToNbr = qw(jan 1 feb 2 mar 3 apr 4 may 5 jun 6 jul 7 aug 8
sep 9
oct 10 nov 11 dec 12);
foreach my $MySortedFile (sort { $a->[1] <=> $b->[1]
or
$AlphaToNbr{lc($a->[2])} <=
$AlphaToNbr{lc($b-
[2])} or
$a->[3] <=> $b->[3]
}
map {[$_, /^.(d{4})(w{3})(d{2})/]}
<DATA> ) {
chomp($MySortedFile->[0]);
print $MySortedFile->[0] . "n";
}


I have a rudimentary understanding of the above now (from reading Ch 15
of 'Learning Perl' and the perlreftut manpage). This is what I grok
(assuming a YYYYmmmDD format):

1. Create a hash with numeric equivalents for the months.
2. Perform a sort by first comparing numbers (I'm assuming the YYYY, but
I don't quite know how that reference works).
3. Then comparing mmm (having been converted to lower case, the value of
the respective AlphaToNbr key is compared).
4. Compare another pair of numbers (DD?)

After that I'm lost. I'm not familiar with the 'map' function or what
happens after that.

Also, you use an imaginary scalar that would contain the data. I have
the data in an array and tried to use your sortsub as follows (taking
some cues from the LP book); I think I'm off base...

----begin snip----
(omitted creation of array @user_links)
my %AlphaToNumber = (
jan => "1",
feb => "2",
mar => "3",
apr => "4",
may => "5",
jun => "6",
jul => "7",
aug => "8",
sep => "9",
"oct" => "10",
nov => "11",
dec => "12",
);

# sort chronologically using a code snippet from 'Wags'
@user_links = ( sort {
$a->[1] <=> $b->[1]
or
$AlphaToNumber{lc($a->[2])} <=> $AlphaToNumber{lc($b->[2])}
or
$a->[3] <=> $b->[3]
} @user_links;
map {[$_, /^.(d{4})(w{3})(d{2})/]}
<DATA> );
----end snip----

QUOTE
__DATA__
/2005Jul01-2005Jul02/foo/bar.html
/2005Jul05-2005Jul06/foo/bar.html
/2005Jun09-2005Jun10/foo/bar.html
/2005Jun10-2005Jun11/foo/bar.html
/2005Jun13-2005Jun14/foo/bar.html
/2005Jun14-2005Jun15/foo/bar.html

Output:
/2005Jun09-2005Jun10/foo/bar.html
/2005Jun10-2005Jun11/foo/bar.html
/2005Jun13-2005Jun14/foo/bar.html
/2005Jun14-2005Jun15/foo/bar.html
/2005Jul01-2005Jul02/foo/bar.html
/2005Jul05-2005Jul06/foo/bar.html


Ryan Frantz wrote:
QUOTE
Here is one way to approach:
!perl

use strict;
use warnings;

my %AlphaToNbr = qw(jan 1 feb 2 mar 3 apr 4 may 5 jun 6 jul 7 aug 8
sep 9 oct 10 nov 11 dec 12); foreach my $MySortedFile (sort {
$a->[1] <=> $b->[1]
or
$AlphaToNbr{lc($a->[2])} <=
$AlphaToNbr{lc($b-
[2])} or
$a->[3] <=> $b->[3]
}
map {[$_, /^.(d{4})(w{3})(d{2})/]}
<DATA> ) {
chomp($MySortedFile->[0]);
print $MySortedFile->[0] . "n";
}


I have a rudimentary understanding of the above now (from reading Ch
15 of 'Learning Perl' and the perlreftut manpage).  This is what I
grok (assuming a YYYYmmmDD format):

1. Create a hash with numeric equivalents for the months.
2. Perform a sort by first comparing numbers (I'm assuming the YYYY,
but I don't quite know how that reference works).
The (d{4}) implies numeric and there MUST be four digits(the form can be {lowest char count,highest character count). If you leave blank like {1,} then says 1 to as many as you can find.
3. Then comparing mmm (having been converted to lower case, the value
of the respective AlphaToNbr key is compared).
4. Compare another pair of numbers (DD?)
Correct if the year is equal and the month is equal then compare the days.

After that I'm lost.  I'm not familiar with the 'map' function or what
happens after that.

Also, you use an imaginary scalar that would contain the data.  I have
the data in an array and tried to use your sortsub as follows (taking
some cues from the LP book); I think I'm off base...

----begin snip----
(omitted creation of array @user_links)
my %AlphaToNumber = (
jan => "1",
feb => "2",
mar => "3",
apr => "4",
may => "5",
jun => "6",
jul => "7",
aug => "8",
sep => "9",
"oct" => "10",
nov => "11",
dec => "12",
);

# sort chronologically using a code snippet from 'Wags'
@user_links = ( sort {
$a->[1] <=> $b->[1]
or
$AlphaToNumber{lc($a->[2])} <=> $AlphaToNumber{lc($b->[2])}
or
$a->[3] <=> $b->[3]
} @user_links;
map {[$_, /^.(d{4})(w{3})(d{2})/]}
<DATA> );

Just take <DATA> and replace that with @user_links. The code should look like:
@user_links = ( sort {
$a->[1] <=> $b->[1]
or
$AlphaToNumber{lc($a->[2])} <=> $AlphaToNumber{lc($b->[2])}
or
$a->[3] <=> $b->[3]
}
map {[$_, /^.(d{4})(w{3})(d{2})/]}
@user_links;

The map is creating an array reference where
item 0 is Capturing the whole input line
item 1 is the (d{4}) Capturing the year
item 2 is the (w{3}) Capturing the month
item 3 is the (d{2}) Capturing the day
Now as you view the sort you see the $a->[1] <=> $b->[1] which is comparing the year then [2] is comparing the numeric month then [3] is comparing the numeric day

Wags ;)
QUOTE
----end snip----

__DATA__
/2005Jul01-2005Jul02/foo/bar.html
/2005Jul05-2005Jul06/foo/bar.html
/2005Jun09-2005Jun10/foo/bar.html
/2005Jun10-2005Jun11/foo/bar.html
/2005Jun13-2005Jun14/foo/bar.html
/2005Jun14-2005Jun15/foo/bar.html

Output:
/2005Jun09-2005Jun10/foo/bar.html
/2005Jun10-2005Jun11/foo/bar.html
/2005Jun13-2005Jun14/foo/bar.html
/2005Jun14-2005Jun15/foo/bar.html
/2005Jul01-2005Jul02/foo/bar.html
/2005Jul05-2005Jul06/foo/bar.html



*******************************************************
This message contains information that is confidential
and proprietary to FedEx Freight or its affiliates.
It is intended only for the recipient named and for
the express purpose(s) described therein.
Any other use is prohibited.
*******************************************************

Ryan Frantz
QUOTE
Just take <DATA> and replace that with @user_links. The code
should
look like:
@user_links = ( sort {
$a->[1] <=> $b->[1]
or
$AlphaToNumber{lc($a->[2])} <=> $AlphaToNumber{lc($b->[2])}
or
$a->[3] <=> $b->[3]
}
map {[$_, /^.(d{4})(w{3})(d{2})/]}
@user_links;

The map is creating an array reference where
item 0 is              Capturing the whole input line
item 1 is the (d{4}) Capturing the year
item 2 is the (w{3}) Capturing the month
item 3 is the (d{2}) Capturing the day
Now as you view the sort you see the $a->[1] <=> $b->[1] which is
comparing the year then [2] is comparing the numeric month then [3] is
comparing the numeric day

Wags ;)

To finalize the script, I need to output this data. I did the following
and it works.

----begin----
(omitted opening filehandle and other prints...)

foreach (@user_links) {
print USER "$_->[0]";
}
----end----

Many thanks to Wags and japhy; I've really learned a lot from the both
of you. I'm off the pick up 'Programming Perl' after I finish 'Learning
Perl'...

ry

Wiggins d'Anconia
Ryan Frantz wrote:
[snip]
QUOTE

Many thanks to Wags and japhy; I've really learned a lot from the both
of you.  I'm off the pick up 'Programming Perl' after I finish 'Learning
Perl'...

ry


Pick up the Learning Perl Object, References, and Modules book before
picking up Programming Perl, it will serve you better as a learning
tool. Though they may seem like more advanced topics, learning them will
accelerate your Perl usage and save you loads of time. I am certainly
not saying that you shouldn't have the Camel on your desk, but the
learning series when first starting will be more helpful. Much of what
the camel provides is already available from perldoc. If you can only
get one, stick with the learning series, though if you can swing both
there is no reference source like the Camel.

http://danconia.org

Scott R. Godin
Jeff 'japhy' Pinyan wrote:
QUOTE
On Jul 6, Ryan Frantz said:

I'm working on a script that will generate a listing of files on a
regular basis so that I can create hyperlinks to each respective file.
As you see from the sorted output below, though it is in ASCIIbetical
order, it is not in chronological order:

/2005Jul01-2005Jul02/foo/bar.html
/2005Jul05-2005Jul06/foo/bar.html
/2005Jun09-2005Jun10/foo/bar.html
/2005Jun10-2005Jun11/foo/bar.html
/2005Jun13-2005Jun14/foo/bar.html
/2005Jun14-2005Jun15/foo/bar.html

Is there any decent documentation available that I could study so that I
can sort this better?

I thought about grabbing the ctime of each file and sorting on that but
I'm not sure if that would add unnecessary complexity to the script.


The primary problem is that the dates in the filenames are formatted as
"YYYYmmmDD" rather than "YYYYMMDD".  Before sorting the filenames, you
could convert the month NAMES to numerical representations (Jan => 01,
Dec => 12), and then after you've sorted them (ASCIIbetically will work
here) you can change those numerical representations back to the month
names.


from Programming Perl's 'Efficiency' chapter:

"Sorting on a manufactured key array may be faster than using a fancy
sort subroutine. A given array value may participate in several sort
comparisons, so if the sort subroutine has to do much recalculation,
it's better to factor out that calculation to a separate pass before the
actual sort." :-)

such as, for example, building the hash keys on the fly as you slurp in
the dir names/paths into that key's value . . . (hint, hint, did the
lightbulb go off yet?)

%month2num = ( Jan => 01, Feb => 02, ...Jun => 06, ... Dec => 12);#fixme
foreach ( qw[ /2005Jun13-2005Jun14/foo/bar.html ] ) {
my $fullpath = $_;
my ($y1, $m1, $d1, $y2, $m2, $d2) =
m(
^/ # starting slash
(d{4}) # year
(w{3}) # monthname
(d{2}) # day
- # a dash
(d{4}) # etc, etc,
(w{3})
(d{2})
/
)x
or warn "$pathname didn't match!" && next;
my ($MD1, $MD2) = $month2num{ $m1 }, $month2num{ $m2 };
$squid{"$y1$MD2$d1-$y2$MD2$d2"} = $fullpath;
}
print "$squid{$_}n" foreach sort keys %squid;

Of course, there are certain constraints involved with regards to how
much memory you'll use if your list is long... :)


PHP Help | Linux Help | Web Hosting | Reseller Hosting | SSL Hosting
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2005 Invision Power Services, Inc.