WorkTheWeb Forums > DBI insert vs update question

Peter Rabbitson

Jul 10 2005, 02:09 PM

Hello everyone,
I want someone to share his expertise on the following:
Suppose we have a table with some data that periodically must be synced
with an external source, which provides only partial information (e.g.
it might provide all the fields for a new record, but might provide
only a single field difference for an already existing record). This
obviously will involve a series of UPDATE and INSERT statements, mostly
in random order. As most RDBMS can hold only one prepared statement at a
time (including MySQL) I have 2 ways to do this that look somewhat
efficient:

1. Sweep the pre-existing table into a hash, DELETE all the records from
it and INSERT the new data, using the hash as a source for missing
fields.

2. Create 3 connections to the database, have the first one hold a
SELECT statement that will replace the hash above, have the second
connection be an INSERT for nonexisting records and the third connection
be an UPDATE for existing ones

From reading on DBI I understood that moving lots of data with do()
statements is pretty slow (e.g. constantly re-preparing the same
statement), so I think scenario 2 with a single connection is even
worse.

The reason I am elaborating on this is that I am afraid the dataset will
stop fitting in memory at some point of time if I go with scenario 1
(which is easier and cleaner to implement). Also I have no idea how
resource intensive scenario 2 would be, although I have proper
indexing and stuff so the UPDATES should be fast enough...

Thanks

Peter

Octavian Rasnita

Jul 10 2005, 02:23 PM

Can't you use the replace sql query?

Use it like you use "insert". It will insert new rows where there are no
rows, and do an update where there are rows...

Teddy

----- Original Message -----
From: "Peter Rabbitson" <[Email Removed]>
To: <[Email Removed]>
Sent: Sunday, July 10, 2005 18:09 PM
Subject: DBI insert vs update question

QUOTE

Hello everyone,
I want someone to share his expertise on the following:
Suppose we have a table with some data that periodically must be synced
with an external source, which provides only partial information (e.g.
it might provide all the fields for a new record, but might provide
only a single field difference for an already existing record). This
obviously will involve a series of UPDATE and INSERT statements, mostly
in random order. As most RDBMS can hold only one prepared statement at a
time (including MySQL) I have 2 ways to do this that look somewhat
efficient:

1. Sweep the pre-existing table into a hash, DELETE all the records from
it and INSERT the new data, using the hash as a source for missing
fields.

2. Create 3 connections to the database, have the first one hold a
SELECT statement that will replace the hash above, have the second
connection be an INSERT for nonexisting records and the third connection
be an UPDATE for existing ones

From reading on DBI I understood that moving lots of data with do()
statements is pretty slow (e.g. constantly re-preparing the same
statement), so I think scenario 2 with a single connection is even
worse.

The reason I am elaborating on this is that I am afraid the dataset will
stop fitting in memory at some point of time if I go with scenario 1
(which is easier and cleaner to implement). Also I have no idea how
resource intensive scenario 2 would be, although I have proper
indexing and stuff so the UPDATES should be fast enough...

Thanks

Peter

--
To unsubscribe, e-mail: [Email Removed]
For additional commands, e-mail: [Email Removed]
<http://learn.perl.org/> <http://learn.perl.org/first-response

Peter Rabbitson

Jul 10 2005, 02:33 PM

On Sun, Jul 10, 2005 at 06:23:19PM +0300, Octavian Rasnita wrote:

QUOTE

Can't you use the replace sql query?

Use it like you use "insert". It will insert new rows where there are no
rows, and do an update where there are rows...

Negative. REPLACE is just a shortcut for DELETE FROM... INESERT INTO (at
least in MySQL) with some pretty neat gimmicks to find out what exactly
to DELETE. Performance is the same but you have less control.

Peter

Peter Rabbitson

Jul 10 2005, 02:35 PM

QUOTE

I'm sure others will give you more informed answers.. But why can't
you create multiple statement handlers under the same connection?

Because you can't. One connection holds only one prepared statement (at
least in MySQL). If you prepare $statement2 on the same $dbh,
$statement1 automatically gets invalidated. Clinically proven :)

Peter

Mads N. Vestergaard

Jul 10 2005, 03:13 PM

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hey Peter,

Are you sure about that?
As far as I understand, what you say you can't do, is to have two
prepares for example:

my $dbh = DBI->connect("dbi:mysql:database","user","pass");
my $foo = $dbh->prepare("SELECT * FROM .....");
my $bar = $dbh->prepare("INSERT INTO .....");
$foo->execute();
$foo->finish();
$bar->execute();
$bar->finish();

At least that works for me.

Sorry if I misunderstood you.

Mads

Peter Rabbitson wrote:
|>I'm sure others will give you more informed answers.. But why can't
|>you create multiple statement handlers under the same connection?
|>
|
|
| Because you can't. One connection holds only one prepared statement (at
| least in MySQL). If you prepare $statement2 on the same $dbh,
| $statement1 automatically gets invalidated. Clinically proven :)
|
| Peter
|
|

- --
Mads N. Vestergaard - http://rwxr-xr-x.dk
Interested in Open Source, and web application development
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFC0Ukt7WOaPMd53OMRAvcCAJ0Y/kCn1Oh67D3AUrCfpq7ihBOu8QCeO/SC
JMvYjVK+8naZfgReXn13AQU=
=psBJ
-----END PGP SIGNATURE-----

David Van Ginneken

Jul 10 2005, 04:31 PM

Below is taken from: http://search.cpan.org/dist/DBD-mysql/lib/DBD/mysql.pm

Also look at: http://dev.mysql.com/tech-resources/articl...statements.html

Prepared statement support (server side prepare)

To use server side prepared statements, all you need to do is set
the variable mysql_server_prepare in the connect:

$dbh = DBI->connect(
"DBI:mysql:database=test;host=localhost:mysql_server_prepare=1", "",
"", { RaiseError => 1, AutoCommit => 1 } );

To make sure that the 'make test' step tests whether server
prepare works, you just need to export the env variable
MYSQL_SERVER_PREPARE:

export MYSQL_SERVER_PREPARE=1

Test first without server side prepare, then with.

On 7/10/05, Mads N. Vestergaard <[Email Removed]> wrote:

QUOTE

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hey Peter,

Are you sure about that?
As far as I understand, what you say you can't do, is to have two
prepares for example:

my $dbh = DBI->connect("dbi:mysql:database","user","pass");
my $foo = $dbh->prepare("SELECT * FROM .....");
my $bar = $dbh->prepare("INSERT INTO .....");
$foo->execute();
$foo->finish();
$bar->execute();
$bar->finish();

At least that works for me.

Sorry if I misunderstood you.

Mads

Peter Rabbitson wrote:
|>I'm sure others will give you more informed answers.. But why can't
|>you create multiple statement handlers under the same connection?
|
|
|
| Because you can't. One connection holds only one prepared statement (at
| least in MySQL). If you prepare $statement2 on the same $dbh,
| $statement1 automatically gets invalidated. Clinically proven :)
|
| Peter
|
|

- --
Mads N. Vestergaard - http://rwxr-xr-x.dk
Interested in Open Source, and web application development
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFC0Ukt7WOaPMd53OMRAvcCAJ0Y/kCn1Oh67D3AUrCfpq7ihBOu8QCeO/SC
JMvYjVK+8naZfgReXn13AQU> =psBJ
-----END PGP SIGNATURE-----

--
To unsubscribe, e-mail: [Email Removed]
For additional commands, e-mail: [Email Removed]
<http://learn.perl.org/> <http://learn.perl.org/first-response

Wiggins d'Anconia

Jul 10 2005, 04:39 PM

Peter Rabbitson wrote:

QUOTE

Hello everyone,
I want someone to share his expertise on the following:
Suppose we have a table with some data that periodically must be synced
with an external source, which provides only partial information (e.g.
it might provide all the fields for a new record, but might provide
only a single field difference for an already existing record). This
obviously will involve a series of UPDATE and INSERT statements, mostly
in random order. As most RDBMS can hold only one prepared statement at a
time (including MySQL) I have 2 ways to do this that look somewhat
efficient:

What version of MySQL? Generally this would be handled with a
transaction, but only newer MySQLs support them. And I suspect Randal
(if he is listening) will pop in with a comment about using PostgreSQL
right about now :-). And now, having used both I would agree with him.

QUOTE

1. Sweep the pre-existing table into a hash, DELETE all the records from
it and INSERT the new data, using the hash as a source for missing
fields.

Yeh this sounds like a nightmare.

QUOTE

2. Create 3 connections to the database, have the first one hold a
SELECT statement that will replace the hash above, have the second
connection be an INSERT for nonexisting records and the third connection
be an UPDATE for existing ones

As the other posters said, and to my knowledge, you should double check
that you can't prepare multiple statements. From the DBI docs:

http://search.cpan.org/~timb/DBI-1.48/DBI.pm#prepare

"Portable applications should not assume that a new statement can be
prepared and/or executed while still fetching results from a previous
statement."

That is a "should not assume" rather than a can't do it, and that is
specifically geared towards "portable applications" which generally
means when dealing with multiple different db backends. You might also
want to have a read through the 'prepare_cached' section following the
above. It appears to be a way in DBI to handle the same if the backend
doesn't support it.

QUOTE

From reading on DBI I understood that moving lots of data with do()
statements is pretty slow (e.g. constantly re-preparing the same
statement), so I think scenario 2 with a single connection is even
worse.

Slow is still only meaningful in your context until you benchmark
whether it really is too slow you won't know.

QUOTE

The reason I am elaborating on this is that I am afraid the dataset will
stop fitting in memory at some point of time if I go with scenario 1
(which is easier and cleaner to implement). Also I have no idea how
resource intensive scenario 2 would be, although I have proper
indexing and stuff so the UPDATES should be fast enough...

Easier I would agree with, cleaner I definitely wouldn't, at least not
if we are talking about enterprise level stuff. There are all kinds of
issues you can run into when trying to do a DELETE/INSERT instead of the
more appropriate UPDATE, especially when there is an intermediary (your
Perl hash) involved.

QUOTE

Thanks

Peter

HTH some,

http://danconia.org