Alexander Förster

APS runs the Database backup as a distributed action. The control node builds a distributed query plan to perform a parallel database backup. Each node involved in the backup writes one backup file to  the backup server. The backup is done in parallel with the use of the infinband network. Figure from the APS help file:

bild1

According to this, there are issues to note.

  • The master is not backed up in parallel.
  • If there are several databases to be backed up, the backups are performed sequentially.

 

During the backup the status can be seen in the admin-console under  „Backups/Restores“:Bild2

The progress of the actual backup can be seen here, but only the total status.

We have some performance issues with our backups on the APS/PDW, because they exceeded in time and we were not able to do any loads because the backups needs an exclusive lock.

The first suggestion was that the network as a bottleneck, but we have tested it with different infrastructure settings and even in an infiny-band network the backups took the same „long“ time.

During the backup the first 30-50% are reached very fast and the remaining percents comes very slowly.

So we investigated further  and the following query has showed us a big suprise:

 

One of the nodes finished only after 1 hour while the other nodes  (in this case only 3 further nodes)  took five times longer:

TheNodes

This may occur because the distribution might be different but a little checkup in the storage properties of the database shows the following, the database files on each node were nearly the same:

bild4

We have tested this issue using two different half-racked APS with AU2 and even in  a half-racked APS with AU3. Even here are differences between the nodes. With the AU3 the differences were smaller  and the backup itself has performed better, but there is still an issue.

 

Conclusion:

  • Backup is ready, when the last node has performed the backup
  • The weak (1Gbit/s network) is not the weakes part in the backup chain, there must be something in the APS which is slowing down the backup on „some“ nodes.

A ticket at Microsoft has been opened to solve this issue.