Hello everyone,
I want someone to share his expertise on the following:
Suppose we have a table with some data that periodically must be synced
with an external source, which provides only partial information (e.g.
it might provide all the fields for a new record, but might provide
only a single field difference for an already existing record). This
obviously will involve a series of UPDATE and INSERT statements, mostly
in random order. As most RDBMS can hold only one prepared statement at a
time (including MySQL) I have 2 ways to do this that look somewhat
efficient:
1. Sweep the pre-existing table into a hash, DELETE all the records from
it and INSERT the new data, using the hash as a source for missing
fields.
2. Create 3 connections to the database, have the first one hold a
SELECT statement that will replace the hash above, have the second
connection be an INSERT for nonexisting records and the third connection
be an UPDATE for existing ones
From reading on DBI I understood that moving lots of data with do()
statements is pretty slow (e.g. constantly re-preparing the same
statement), so I think scenario 2 with a single connection is even
worse.
The reason I am elaborating on this is that I am afraid the dataset will
stop fitting in memory at some point of time if I go with scenario 1
(which is easier and cleaner to implement). Also I have no idea how
resource intensive scenario 2 would be, although I have proper
indexing and stuff so the UPDATES should be fast enough...
Thanks
Peter