APS runs the Database backup as a distributed action. The control node builds a distributed query plan to perform a parallel database backup. Each node involved in the backup writes one backup file to the backup server. The backup is done in parallel with the use of the infinband network. Figure from the APS help file:
According to this, there are issues to note.
- The master is not backed up in parallel.
- If there are several databases to be backed up, the backups are performed sequentially.
The progress of the actual backup can be seen here, but only the total status.
We have some performance issues with our backups on the APS/PDW, because they exceeded in time and we were not able to do any loads because the backups needs an exclusive lock.
The first suggestion was that the network as a bottleneck, but we have tested it with different infrastructure settings and even in an infiny-band network the backups took the same „long“ time.
During the backup the first 30-50% are reached very fast and the remaining percents comes very slowly.
So we investigated further and the following query has showed us a big suprise:
select<span style="line-height: 1.5"> run.run_id</span><span style="line-height: 1.5">, database_name</span>,
RIGHT('0' + CAST(run.total_elapsed_time /1000 / 3600 AS VARCHAR),2)
+ ':' + RIGHT('0' + CAST((run.total_elapsed_time /1000 / 60) % 60 AS VARCHAR),2)
+ ':' +RIGHT('0' + CAST( run.total_elapsed_time /1000 % 60 % 60 AS VARCHAR),2)
, run.status , det.pdw_node_id as Node ,
RIGHT('0' + CAST(det.total_elapsed_time /1000 / 3600 AS VARCHAR),2)
+ ':' + RIGHT('0' + CAST((det.total_elapsed_time /1000 / 60) % 60 AS VARCHAR),2)
+ ':' +RIGHT('0' + CAST( det.total_elapsed_time /1000 % 60 % 60 AS VARCHAR),2)
from sys.pdw_loader_backup_runs run
left join sys.pdw_loader_backup_run_details det on run.run_id = det.run_id
where operation_type = 'BACKUP'
and mode = 'FULL'
order by Submit_time desc
One of the nodes finished only after 1 hour while the other nodes (in this case only 3 further nodes) took five times longer:
This may occur because the distribution might be different but a little checkup in the storage properties of the database shows the following, the database files on each node were nearly the same:
We have tested this issue using two different half-racked APS with AU2 and even in a half-racked APS with AU3. Even here are differences between the nodes. With the AU3 the differences were smaller and the backup itself has performed better, but there is still an issue.
- Backup is ready, when the last node has performed the backup
- The weak (1Gbit/s network) is not the weakes part in the backup chain, there must be something in the APS which is slowing down the backup on „some“ nodes.
A ticket at Microsoft has been opened to solve this issue.