In the “Update on Infrastructure” post, we talked about how we needed to get disks that would adequately handle our performance needs and were designed for our hardware (i.e. “supported by the manufacturer”). Those arrived on Tuesday and we brought them to our data center straight away and built the new RAID array.
Yesterday, as of the writing of this post, we put the Open Hub into Read Only mode. Our first plan was to use our replicated database to run a pg_basebackup from slave to master, however the replicated database was a few hours out of synch. It shouldn’t have been more than a few tens of milliseconds. So, we decided to do a pg_dumpall of the primary database, change the mounts so the database server was pointing to the new RAID array and restore the cluster.
We were optimistic that this would take 4 – 6 hours, 10 at the far outside.
Ladies and Gentlemen, Girls and Boys; this restore has been running for 18 and three-quarter hours and it is very difficult to determine precisely how long it will take. The restore process, being a straight file load into pgsql, has no progress indicators.
Again, we apologize for the inconvenience this is causing. On the plus side, we will be examining all aspects of our replication implementation, have added appropriate new monitoring and reporting, and are planning for architectural changes that will let us continue to serve the API even if the rest of the website has to be put into RO mode. Oh, and when we’re done, we’ll have brand new indexes on our database. That will be nice.