WorkTheWeb Forums > Text file too large for Perl?

Full Version: Text file too large for Perl?

WorkTheWeb Forums > Webmaster Resources > Perl Beginner Help

James marks

Mar 6 2005, 09:34 PM

I hope someone can explain this and offer a solution for me,

I've written a Perl script to parse an HTML file that was produced as a
result of exporting database information from FileMaker Pro. The
exported HTML file is 21.5 megs.

I tested the script against a subset of the data in the HTML file and
the script worked fine (approximately 50 records or so). When I run the
script against the entire 21.5 meg HTML file, however, the script
produces only a partial output file then stops.

It seems as though the 21.5 meg file is too much for Perl to handle
(which would surprise me). Does the problem lie in Perl, or is it some
other limitation? How do I parse this file without breaking it into a
hundred little chunks?

I'm running the script on a 1 GHz Mac G4 with nearly 800 megs of RAM,
if that's any help.

I'm parsing the file with a standard "while" loop to walk through each
record and write a new file that I can load into a MySQL database:

(pseudo-code follows)

while (<FILE_IN>) {
chomp();
$line_count++;
if ($line_count == n) {
m/<td>(.*)</td>/;
my $variable = $1;
print "$variable in new contextn";
} elsif ($line_count == o) {
...
}

If this isn't going to work, how do a parse a 21.5 meg text file?

Thanks!

James

Charles K. Clarkson

Mar 6 2005, 11:09 PM

James marks <mailto:[Email Removed]> wrote:

: I hope someone can explain this and offer a solution for me,
:
: I've written a Perl script to parse an HTML file that was produced as
: a result of exporting database information from FileMaker Pro. The
: exported HTML file is 21.5 megs.
:
: I tested the script against a subset of the data in the HTML file and
: the script worked fine (approximately 50 records or so). When I run
: the script against the entire 21.5 meg HTML file, however, the script
: produces only a partial output file then stops.

Real code we point out problems in, pseudo-code is pretty useless.

: It seems as though the 21.5 meg file is too much for Perl to handle
: (which would surprise me). Does the problem lie in Perl, or is it
: some other limitation? How do I parse this file without breaking it
: into a hundred little chunks?

The problem is most likely in your algorithm. Show us the code.

: I'm running the script on a 1 GHz Mac G4 with nearly 800 megs of RAM,
: if that's any help.
:
: I'm parsing the file with a standard "while" loop to walk through each
: record and write a new file that I can load into a MySQL database:
:
: (pseudo-code follows)

I'm confused. Why show us pseudo-code if you already have real code
written?

: while (<FILE_IN>) {
: chomp();
: $line_count++;
: if ($line_count == n) {
: m/<td>(.*)</td>/;
: my $variable = $1;
: print "$variable in new contextn";
: } elsif ($line_count == o) {
: ...
: }
$line_count will never equal o (the letter). Perhaps 0 (the integer).

Unless you are using a whole bunch of other line number specific
tests, it would be probably be more efficient to do the first line
OUTSIDE the while block. You also might want to test for $1 unless you
are absolutely certain it will succeed on every line.

# Process first line
my $first_line = <FILE_IN>;
....

# Process rest of lines
while (<FILE_IN>) {
next unless m/<td>(.*)</td>/;
my $variable = $1;
print "$variable in new contextn";
}

I assume you are actually printing to a file in your real code.

HTH,

Charles K. Clarkson
--
Mobile Homes Specialist

James marks

Mar 7 2005, 01:52 AM

QUOTE

The problem is most likely in your algorithm. Show us the code.

(Oops. Replied only to Charles by accident. Reposting to the list:)

Sorry. I was posting a part of the real code only to avoid posting an
overly long string of code. The problem, it seemed to me, was more
likely some other limitation than Perl since the code ran fine on files
up to a certain size. However, I'll post the whole code if you'd like:

(The script creates a file that can be loaded into a MySQL database.)

#!/usr/bin/perl

use warnings;
use strict;

my $source_file = "/users/jamesmarks/desktop/published_stories.htm";
my $destination_file = "/users/jamesmarks/desktop/published_stories.tb";
my $line_count = 0;
my $total_line_count = 0;

open FILE_IN, "$source_file" or die "Cannot open source file: $!";
open FILE_OUT, ">$destination_file" or die "Cannot open destination
file: $!";

select FILE_OUT;

print <<'DEFINE_TABLE';
USE trib_stories;

DROP TABLE IF EXISTS story;

CREATE TABLE story (
story_id INT AUTO_INCREMENT,
issue_date DATE,
section VARCHAR(10),
byline VARCHAR(25),
staff_writer INT,
headline VARCHAR(50),
subhead VARCHAR(255),
body_copy TEXT,
caption_1 VARCHAR(255),
caption_2 VARCHAR(255),
caption_3 VARCHAR(255),
caption_4 VARCHAR(255),
caption_5 VARCHAR(255),
caption_6 VARCHAR(255),
PRIMARY KEY (story_id),
INDEX index1 (issue_date, section, byline),
FULLTEXT (headline),
FULLTEXT (body_copy)
);

DEFINE_TABLE

while (<FILE_IN>) {
chomp();
$line_count++;
$total_line_count++;
if ($line_count == 1) {
print "INSERT INTO storynSET story_id = NULL,n";
} elsif ($line_count == 2) {
m{(dd?)/(dd?)/(dddd)};
my $issue_date = "$3-$1-$2";
if ($issue_date eq " ") {
$issue_date = "";
}
print "issue_date = "$issue_date",n";
} elsif ($line_count == 3) {
m{<TD>(.*)</TD>};
my $section = $1;
if ($section eq " ") {
$section = "";
}
print "section = "$section",n";
} elsif ($line_count == 4) {
m{<TD>(.*)</TD>};
my $byline = $1;
if ($byline eq " ") {
$byline = "";
}
print "byline = "$byline",n";
} elsif ($line_count == 5) {
m{<TD>(.*)</TD>};
my $staff_writer = $1;
if ($staff_writer eq " ") {
$staff_writer = "";
}
print "staff_writer = "$staff_writer",n";
} elsif ($line_count == 6) {
s/ //g;
m{<TD>(.*)</TD>};
my $headline = $1;
if ($headline eq " ") {
$headline = "";
}
print "headline = "$headline",n";
} elsif ($line_count == 7) {
s/ //g;
m{<TD>(.*)</TD>};
my $subhead = $1;
if ($subhead eq " ") {
$subhead = "";
}
print "subhead = "$subhead",n";
} elsif ($line_count == 8) {
m{<TD>(.*)</TD>};
my $body_copy = $1;
if ($body_copy eq " ") {
$body_copy = "";
}
print "body_copy = "$body_copy",n";
} elsif ($line_count == 9) {
m{<TD>(.*)</TD>};
my $caption_1 = $1;
if ($caption_1 eq " ") {
$caption_1 = "";
}
print "caption_1 = "$caption_1",n";
} elsif ($line_count == 10) {
m{<TD>(.*)</TD>};
my $caption_2 = $1;
if ($caption_2 eq " ") {
$caption_2 = "";
}
print "caption_2 = "$caption_2",n";
} elsif ($line_count == 11) {
m{<TD>(.*)</TD>};
my $caption_3 = $1;
if ($caption_3 eq " ") {
$caption_3 = "";
}
print "caption_3 = "$caption_3",n";
} elsif ($line_count == 12) {
m{<TD>(.*)</TD>};
my $caption_4 = $1;
if ($caption_4 eq " ") {
$caption_4 = "";
}
print "caption_4 = "$caption_4",n";
} elsif ($line_count == 13) {
m{<TD>(.*)</TD>};
my $caption_5 = $1;
if ($caption_5 eq " ") {
$caption_5 = "";
}
print "caption_5 = "$caption_5",n";
} elsif ($line_count == 14) {
m{<TD>(.*)</TD>};
my $caption_6 = $1;
if ($caption_6 eq " ") {
$caption_6 = "";
}
print "caption_6 = "$caption_6";nn";
}
if ($line_count == 15) {
$line_count = 0;
}
}

close FILE_IN;
close FILE_OUT;

James marks

Mar 7 2005, 01:57 AM

QUOTE

The problem is most likely in your algorithm. Show us the code.

It turns out, after some testing, that if I break the 21.5 meg HTML
file into two roughly equal pieces, the script runs with no problem and
produces the results expected. So it appears, to my inexperienced eye,
that the problem is related to the file size.

Does this make sense?

James

John W. Krahn

Mar 7 2005, 03:24 AM

James marks wrote:

QUOTE

There is nothing in your code that I can see that would not work with a large
file (if you consider 21.5 MB large.) :-) Perhaps it is a limitation of the
operating system?

QUOTE

(The script creates a file that can be loaded into a MySQL database.)

#!/usr/bin/perl

use warnings;
use strict;

my $source_file = "/users/jamesmarks/desktop/published_stories.htm";
my $destination_file = "/users/jamesmarks/desktop/published_stories.tb";
my $line_count = 0;
my $total_line_count = 0;

open FILE_IN, "$source_file" or die "Cannot open source file: $!";
open FILE_OUT, ">$destination_file" or die "Cannot open destination
file: $!";

select FILE_OUT;

print <<'DEFINE_TABLE';
USE trib_stories;

DROP TABLE IF EXISTS story;

CREATE TABLE story (
story_id INT AUTO_INCREMENT,
issue_date DATE,
section VARCHAR(10),
byline VARCHAR(25),
staff_writer INT,
headline VARCHAR(50),
subhead VARCHAR(255),
body_copy TEXT,
caption_1 VARCHAR(255),
caption_2 VARCHAR(255),
caption_3 VARCHAR(255),
caption_4 VARCHAR(255),
caption_5 VARCHAR(255),
caption_6 VARCHAR(255),
PRIMARY KEY (story_id),
INDEX index1 (issue_date, section, byline),
FULLTEXT (headline),
FULLTEXT (body_copy)
);

DEFINE_TABLE

while (<FILE_IN>) {
chomp();
$line_count++;
$total_line_count++;

You could use the built-in $. variable (since it is there already) instead of
your own $total_line_count and then define $line_count as:

my $line_count = $. % 15;

QUOTE

if ($line_count == 1) {
print "INSERT INTO storynSET story_id = NULL,n";
} elsif ($line_count == 2) {
m{(dd?)/(dd?)/(dddd)};
my $issue_date = "$3-$1-$2";

You should only use the numeric variables if the match succeeded otherwise
they will contain values from the last successful match!

QUOTE

if ($issue_date eq " ") {

$issue_date will *never* be equal to ' '.

QUOTE

$issue_date = "";
}
print "issue_date = "$issue_date",n";
} elsif ($line_count == 3) {
m{<TD>(.*)</TD>};
my $section = $1;
if ($section eq " ") {
$section = "";
}
print "section = "$section",n";
} elsif ($line_count == 4) {
m{<TD>(.*)</TD>};
my $byline = $1;
if ($byline eq " ") {
$byline = "";
}
print "byline = "$byline",n";
} elsif ($line_count == 5) {
m{<TD>(.*)</TD>};
my $staff_writer = $1;
if ($staff_writer eq " ") {
$staff_writer = "";
}
print "staff_writer = "$staff_writer",n";
} elsif ($line_count == 6) {
s/ //g;
^^^^^^^^^

QUOTE

m{<TD>(.*)</TD>};
my $headline = $1;
if ($headline eq " ") {

$headline will *never* be equal to ' '.

QUOTE

$headline = "";
}
print "headline = "$headline",n";
} elsif ($line_count == 7) {
s/ //g;
^^^^^^^^^

QUOTE

m{<TD>(.*)</TD>};
my $subhead = $1;
if ($subhead eq " ") {

$subhead will *never* be equal to ' '.

QUOTE

$subhead = "";
}
print "subhead = "$subhead",n";
} elsif ($line_count == 8) {
m{<TD>(.*)</TD>};
my $body_copy = $1;
if ($body_copy eq " ") {
$body_copy = "";
}
print "body_copy = "$body_copy",n";
} elsif ($line_count == 9) {
m{<TD>(.*)</TD>};
my $caption_1 = $1;
if ($caption_1 eq " ") {
$caption_1 = "";
}
print "caption_1 = "$caption_1",n";
} elsif ($line_count == 10) {
m{<TD>(.*)</TD>};
my $caption_2 = $1;
if ($caption_2 eq " ") {
$caption_2 = "";
}
print "caption_2 = "$caption_2",n";
} elsif ($line_count == 11) {
m{<TD>(.*)</TD>};
my $caption_3 = $1;
if ($caption_3 eq " ") {
$caption_3 = "";
}
print "caption_3 = "$caption_3",n";
} elsif ($line_count == 12) {
m{<TD>(.*)</TD>};
my $caption_4 = $1;
if ($caption_4 eq " ") {
$caption_4 = "";
}
print "caption_4 = "$caption_4",n";
} elsif ($line_count == 13) {
m{<TD>(.*)</TD>};
my $caption_5 = $1;
if ($caption_5 eq " ") {
$caption_5 = "";
}
print "caption_5 = "$caption_5",n";
} elsif ($line_count == 14) {
m{<TD>(.*)</TD>};
my $caption_6 = $1;
if ($caption_6 eq " ") {
$caption_6 = "";
}
print "caption_6 = "$caption_6";nn";
}
if ($line_count == 15) {
$line_count = 0;
}
}

close FILE_IN;
close FILE_OUT;

It looks like you have a lot of duplicated code that could condensed:

# UNTESTED

my %field = (
2 => 'issue_date',
3 => 'section',
4 => 'byline',
5 => 'staff_writer',
6 => 'headline',
7 => 'subhead',
8 => 'body_copy',
9 => 'caption_1',
10 => 'caption_2',
11 => 'caption_3',
12 => 'caption_4',
13 => 'caption_5',
14 => 'caption_6',
);

while ( <FILE_IN> ) {
chomp;
my $line_count = $. % 15;
next unless $line_count; # skip every 15th line

my $capture;
if ( $line_count == 1 ) {
print "INSERT INTO storynSET story_id = NULL,n";
next;
}
elsif ( $line_count == 2 ) {
$capture = join '-', ( m{(dd?)/(dd?)/(dddd)} )[ 3, 1, 2 ];
}
else {
s/ //g;
( $capture ) = m{<TD>(.*)</TD>};
}

print qq($field{$line_count} = "$capture",n);
print "n" if $line_count == 14;
}

my $total_line_count = $.;

John
--
use Perl;
program
fulfillment

James Marks

Mar 7 2005, 04:49 AM

QUOTE

Dang. Replied only to John. Reposting to the list. (I'll learn...)

I'm running it on a 1 GHz Mac OS X (FreeBSD *NIX based) with nearly 800
megs of RAM. Shouldn't that be able to handle it? (Perl 5.8.x, I
believe).

Sorta surprised me that it seems to choke on a 21.5 meg file, hence my
original question, "Why?"

Thanks,

James

John W. Krahn

Mar 7 2005, 08:56 AM

James Marks wrote:

QUOTE

The problem is most likely in your algorithm. Show us the code.

(Oops. Replied only to Charles by accident. Reposting to the list:)
Sorry. I was posting a part of the real code only to avoid posting an
overly long string of code. The problem, it seemed to me, was more
likely some other limitation than Perl since the code ran fine on
files up to a certain size. However, I'll post the whole code if
you'd like:

There is nothing in your code that I can see that would not work with
a large file (if you consider 21.5 MB large.) :-) Perhaps it is a
limitation of the operating system?

Dang. Replied only to John. Reposting to the list. (I'll learn...)

I'm running it on a 1 GHz Mac OS X (FreeBSD *NIX based) with nearly 800
megs of RAM. Shouldn't that be able to handle it? (Perl 5.8.x, I believe).

Sorta surprised me that it seems to choke on a 21.5 meg file, hence my
original question, "Why?"

Perl doesn't (normally) have any built-in size restrictions. From the command
line run perl with the -V switch and look for the string
'uselargefiles=define'. If instead it says 'uselargefiles=undef' then that
could be your problem.

John
--
use Perl;
program
fulfillment

Jenda Krynicky

Mar 7 2005, 02:01 PM

From: James marks <[Email Removed]>

QUOTE

The problem is most likely in your algorithm. Show us the code.

Sorry. I was posting a part of the real code only to avoid posting an
overly long string of code. The problem, it seemed to me, was more
likely some other limitation than Perl since the code ran fine on
files up to a certain size. However, I'll post the whole code if you'd
like:

#!/usr/bin/perl
...

I do not see anything wrong with your code but 21.5MB is definitely
not too big.

Try to add some more logging into the script (like printf( "%02d %d",
$line_count, $total_line_count) to a separate log file) to see
whether it really stops reading or just stops processing the lines.
You may also want to add an else{} branch to your if(){}elseif(){}...
and die printing $line_count if the script ever enters that branch.

Jenda

===== [Email Removed] === http://Jenda.Krynicky.cz =====
When it comes to wine, women and song, wizards are allowed
to get drunk and croon as much as they like.
-- Terry Pratchett in Sourcery

PHP Help | Linux Help | Web Hosting | Reseller Hosting | SSL Hosting

This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.