I'm in the process of implementing a backup service with these major requirements:
- Is external and thus prone to hardware failures
- Provides a time machine ( so older version of files could also be restored )
and the additional wishlist:
- Backups are fast
- The backup process is lightweight ( The servers are used in production and loaded all over the clock )
- Service is reliable
- Implementing it is as simple as possible
- The interface is universal (e.g. it's better to use a filesystem than custom solution over dump/restore)
Of course the Perl motto "there is more than one way to do it" is valid for the major goals.
E.g. The external part could be done via:
- some sort of network file system
- synchronization via a network protocol to a file system living on external host
and the time machine could be done via
- incremental backups (e.g. dump/restore)
- a version control system, with Git being 1st in my list
My current idea is to use:
- Software block device over the net ( External )
- NILFS2 ( Time machine )
So I'm in a hunt for the:
- http://www.massey.ac.nz/~chmessom/APAC2007.pdf - AoE vs. iSCSI performance comparison
- Simple ( avoid over-complication in implementation, configuration, features, dependencies .. )
- Supported by Linux
- Both server and client
- Strongly preferred to be merged in mainline kernel
- Strongly preferred tooling to be packaged in Debian
All the protocols listed below should be interchangeable. I might do some benchmarks at a later stage.
SCSI over internet
Network Block Device
Exporting a device via NBD is a matter of:
root@server:/# apt-get install nbd-server root@server:/# cat /etc/nbd-server/config [generic] [export0] exportname = /dev/mapper/vg0-nbd6.0 port = 99 root@server:/# /etc/init.d/nbd-server restart
And importing it on a client is:
root@client:/# apt-get install nbd-client root@client:/# grep -v '^#' /etc/nbd-client AUTO_GEN="n" KILLALL="true" NBD_DEVICE=/dev/nbd0 NBD_TYPE=r NBD_HOST=SERVER-HOSTNAME NBD_PORT=99 root@client:/# /etc/init.d/nbd-client restart
You might want to check the manual pages in the respective packages for more configuration options and tweaks. E.g. the nbd-client init scripts has the feature to auto mount file systems.
By default, nbd-client creates a blockdevice with a block size of 1024 bytes:
# On the client blockdev --getbsz /dev/nbd0 1024 for ((i=0; i<10; i++)); do dd if=/dev/nbd0 of=/dev/null bs=1M count=1000 iflag=direct 2>&1 | grep bytes ; done 1048576000 bytes (1.0 GB) copied, 12.8387 s, 81.7 MB/s 1048576000 bytes (1.0 GB) copied, 14.1621 s, 74.0 MB/s 1048576000 bytes (1.0 GB) copied, 14.1721 s, 74.0 MB/s 1048576000 bytes (1.0 GB) copied, 15.6536 s, 67.0 MB/s 1048576000 bytes (1.0 GB) copied, 15.1352 s, 69.3 MB/s 1048576000 bytes (1.0 GB) copied, 15.5831 s, 67.3 MB/s 1048576000 bytes (1.0 GB) copied, 14.3358 s, 73.1 MB/s 1048576000 bytes (1.0 GB) copied, 15.256 s, 68.7 MB/s 1048576000 bytes (1.0 GB) copied, 13.9433 s, 75.2 MB/s 1048576000 bytes (1.0 GB) copied, 13.0245 s, 80.5 MB/s # On the server iostat -dk 10 | egrep '^(sd|Device)' Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 365.70 32194.80 380.00 321948 3800 sdb 316.20 31760.40 319.20 317604 3192 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 361.80 39333.20 281.20 393332 2812 sdb 323.20 39295.20 260.80 392952 2608 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 325.20 35762.80 238.40 357628 2384 sdb 274.90 35794.40 201.20 357944 2012
To summarize we have a performance of about 70-80 MB/s and the
server is reading about 100KB in each request. The results are
pretty much the same with 2048 bytes
4k block size drops the transfer rate to 55 MB/s and keeps the 100 KB per IO op rate.
Lets remove the "direct" flag from dd:
# On the client blockdev --getbsz /dev/nbd0 1024 for ((i=0; i<10; i++)); do dd if=/dev/nbd0 of=/dev/null bs=1M count=1000 2>&1 | grep bytes ; done 1048576000 bytes (1.0 GB) copied, 14.5043 s, 72.3 MB/s 1048576000 bytes (1.0 GB) copied, 18.6863 s, 56.1 MB/s 1048576000 bytes (1.0 GB) copied, 15.6981 s, 66.8 MB/s 1048576000 bytes (1.0 GB) copied, 15.8664 s, 66.1 MB/s 1048576000 bytes (1.0 GB) copied, 16.7602 s, 62.6 MB/s 1048576000 bytes (1.0 GB) copied, 18.382 s, 57.0 MB/s 1048576000 bytes (1.0 GB) copied, 17.1475 s, 61.2 MB/s 1048576000 bytes (1.0 GB) copied, 15.3853 s, 68.2 MB/s 1048576000 bytes (1.0 GB) copied, 19.3907 s, 54.1 MB/s 1048576000 bytes (1.0 GB) copied, 21.7969 s, 48.1 MB/s # On the server iostat -dk 10 | egrep '^(sd|Device)' Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 312.60 30968.40 173.60 309684 1736 sdb 284.80 30978.00 172.00 309780 1720 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 330.40 32506.40 166.00 325064 1660 sdb 280.60 32517.20 152.00 325172 1520 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 224.40 33598.80 51.60 335988 516 sdb 208.20 33604.40 60.80 336044 608
So this time we have around 60 MB/s with 100 KB per IO operation ratio (Note that the server is not totally idle and this is not the only disk activity it sees). With a block size of 2048 bytes this tests shows decreased speed of about 50 MB/s and the number of IO ops per second doubles. 4k block size gives us an average of 60 MB/s with 50 kb per IO op.
Lets do some write tests:
# On the client blockdev --getbsz /dev/nbd0 1024 for ((i=0; i<10; i++)); do dd if=/dev/zero of=/dev/nbd0 bs=1M count=1000 oflag=direct 2>&1 | grep bytes ; done 1048576000 bytes (1.0 GB) copied, 10.1818 s, 103 MB/s 1048576000 bytes (1.0 GB) copied, 9.89168 s, 106 MB/s 1048576000 bytes (1.0 GB) copied, 9.73052 s, 108 MB/s 1048576000 bytes (1.0 GB) copied, 9.89912 s, 106 MB/s 1048576000 bytes (1.0 GB) copied, 9.91606 s, 106 MB/s 1048576000 bytes (1.0 GB) copied, 10.0242 s, 105 MB/s 1048576000 bytes (1.0 GB) copied, 9.95247 s, 105 MB/s 1048576000 bytes (1.0 GB) copied, 9.92473 s, 106 MB/s 1048576000 bytes (1.0 GB) copied, 10.0946 s, 104 MB/s 1048576000 bytes (1.0 GB) copied, 10.1183 s, 104 MB/s # On the server iostat -dk 10 | egrep '^(sd|Device)' Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 137.80 7.20 51806.80 72 518068 sdb 144.20 1.20 51798.00 12 517980 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 125.70 16.00 52375.20 160 523752 sdb 132.20 4.80 52362.80 48 523628 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 133.20 4.80 52117.60 48 521176 sdb 130.40 5.20 52265.20 52 522652
Write speed is 105 MB/s with about 500 KB per IO
With block size of 2k and 4k the results of this tests stay the same.
And lets remove the "direct" flag while writing:
# On the client blockdev --getbsz /dev/nbd0 1024 for ((i=0; i<10; i++)); do dd if=/dev/zero of=/dev/nbd0 bs=1M count=1000 2>&1 | grep bytes ; done 1048576000 bytes (1.0 GB) copied, 9.34019 s, 112 MB/s 1048576000 bytes (1.0 GB) copied, 15.3738 s, 68.2 MB/s 1048576000 bytes (1.0 GB) copied, 15.6453 s, 67.0 MB/s 1048576000 bytes (1.0 GB) copied, 20.3934 s, 51.4 MB/s 1048576000 bytes (1.0 GB) copied, 20.1742 s, 52.0 MB/s 1048576000 bytes (1.0 GB) copied, 19.0891 s, 54.9 MB/s 1048576000 bytes (1.0 GB) copied, 20.4181 s, 51.4 MB/s 1048576000 bytes (1.0 GB) copied, 16.8115 s, 62.4 MB/s 1048576000 bytes (1.0 GB) copied, 18.3555 s, 57.1 MB/s 1048576000 bytes (1.0 GB) copied, 20.0491 s, 52.3 MB/s # On the server iostat -dk 10 | egrep '^(sd|Device)' Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 242.30 667.60 28498.00 6676 284980 sdb 261.80 768.00 26874.40 7680 268744 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 236.70 639.60 28760.00 6396 287600 sdb 247.80 653.20 29739.20 6532 297392 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 257.60 760.00 20544.40 7600 205444 sdb 155.30 356.00 21658.40 3560 216584 Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 325.80 1026.40 28021.20 10264 280212 sdb 136.60 238.80 26988.80 2388 269888
We see decreased write speed - around 50-60 MB/s and once again
about 100 KB per IO operation.
The results are pretty much the same with block size of 2048 bytes.
Increasing the block size to 4k though rises the transfer speed to about 100 MB/s and give a nice 500 KB per IO request.
Next: Summarize the above results in a nice table and test with real files and filesystem
|Block Size||Sequential read||Sequential Read + idirect||Sequential write||Sequential write + idirect|
|1k||60 MB / 100 KB||75 MB / 100 KB||55 MB / 100 KB||105 MB / 500 KB|
|2k||50 MB / 50 KB||75 MB / 100 KB||55 MB / 100 KB||105 MB / 500 KB|
|4k||50 MB / 50 KB||55 MB / 100 KB||100 MB / 500 KB||105 MB / 500 KB|
Securing who can access the device is a different story though. The server implementation does not support any authentication. Well it does support IP based ACLs but that is nothing since in most configurations IP addresses could be easily spoofed. I don't see much point in putting such ACL in the server, as it could be easily and more reliably be implemented in the firewall.
So if you want/need security with NBD you should:
- On the server: make sure you limit the access to the TCP port the server is listening on. E.g. only allow certain interface (explicitly disallowing the "lo" interface might also be a good idea) and only allow certain IP address and or MAC addresses.
- On the network: make sure the IP and or MAC addresses that are in the server ACL could not be spoofed. E.g. provide a dedicated wire/vlan/etc and/or use managed switches to guarantee the path to the clients and servers.
- On the network: If you intend to route NBD traffic via some public network you might want to add additional layer of encryption/authentication. IPSec or another tunneling scheme sound useful.
- On the client: It might be a useful idea to limit the NBD traffic to a particular UID (most probably root). This is especially important if you have some untrusted apps running.
nbd-server in Debian testing (as of 100110) does not support the SDP (Socket Direct Protocol) so TCP/IP is used for the tests. SDP is claimed to offer a better performance.
I've read somewhere that NBD is not particularly good in case of connection problems.
DST stands for Distributed STorage
Merged in (recent) 2.6.30 kernel. Update: Unfortunately it was
removed as of the 2.6.33 kernel.
As far as I can see from various resources it is implemented as alternative of NBD and iSCSI.
Its author ( Evgeniy Polyakov ) looks like a good hacker and when a good hacker feels that he has to come with a new implementation there must be something wrong with the old one.
DST looks like the second option I will try as I also plan to implement similar backup solution in a distributed environment over insecure channels.
- New and probably unstable implementation. As of Linux v2.6.32 it is still int the "staging" area.
- Native encryption support so is usable over insecure channels
- Both the client&server are implemented in the kernel
- Single vendor
ATA over Ethernet
- ggaoed - Promising new server implementation
- AoE performance comparison from the AoE specification vendor. Note that this vendor is also the only supplier of AoE hardware devices so the results might be misleading.
- Multiple vendors for the server implementation
- The client is implemented in the kernel ( "aoe" module )
- I will probably try the vblade+aoetools option first as it is already packaged in Debian.
- GGAOED has Debian build scripts
- TODO: Post some results from real world tests
works in layer 2 (Data Link - Ethernet) directly, bypassing the
processing overhead of upper layers (IP, TCP/UDP).
This is a candidate for a performance boost but it also has some drawbacks.
E.g. it could not be easily passed trough routers. Even if Ethernet in IP tunneling is used a TCP fragmentation will likely occur which will probably slow things down. Looks suitable for usage within the data center where performance is needed and the client and the server will either be directly connected or will be interconnected via a good switch supporting jumbo frames.
The AoE protocol
is insecure by design and it is stateless.
So if we want security we should use some additional measures.
Security of the storage
To guarantee the security of the storage we could think of some
sort of isolation of the path.
Several options come to my mind:
- A dedicated Ethernet interfaces and a dedicated wire between client and the server
- VLAN isolation
- MAC filtering on the server and on the switch(es)
With the first one, of course, being the most secure ( switches could also be penetrated ) .
The MAC filtering could be easily misused. If you do the filtering only on the server, then any other host within the network could be reconfigured to become a client.
The path isolation will guarantee that a breach in another host in the same LAN segment will not compromise the storage.
Security of the data
The data security is another topic. Although a man in the middle attack does not look too probable within the data center you might prefer to be paranoiac ( or you might simply have a different setup requiring it ). For this case you could always add additional layer of encryption on the client for the cost of more CPU cycles and probably slightly increased latency.
One additional aspect bugged me.
How about if a user account on the client host gets compromised ? Could it be used to run a AoE client in userspace to gain access to the data?
Thankfully no. The access to the server is done via raw sockets and a dedicated ethertype. The creation of the RAW sockets under Linux requires the CAP_NET_RAW privilege which is usually granted only to root.
Both machines are Dell PowerEdge R200:
- 1 Intel Xeon CPU X3320 @ 2.50GHz with 4 cores
- 4 GB of memory.
- Debian GNU/Linux testing/Squeeze
- 2.6.30-2-686-bigmem kernel package
- 2 x Broadcom NetXtreme BCM5721 ( 1Gbit, No jumbo frame support )
2 HDDs each of them being:
Model Family: Seagate Barracuda ES.2 Device Model: ST3750330NS
Firmware Version: SN05
User Capacity: 750,156,374,016 bytes
The servers are connected via a dedicated wire.
The network interfaces are at:
root@client:/# ethtool eth1 Settings for eth1: Speed: 1000Mb/s Duplex: Full Port: Twisted Pair Auto-negotiation: on Link detected: yes
Both systems were not completely stale during the tests.
Here goes the block device exportation:
root@server:/# lvcreate --verbose --size 500G --name nbd6.0 VGNAME /dev/md8 /dev/md9 root@server:/# vblade 6 0 eth1 /dev/VGNAME/nbd6.0 2>&1
md8 is soft raid0 (stripping) over 2x150 GB partitions
at the end of the HDDs. So is md9. Two physical HDDs are
used in total. The soft raid is added for performance. The
partitioning is done for easier relocation of parts of the
The partitions being at the end of the drive gives roughly 1.5x to 2x performance penalty for sequential operations. This is due to the circular design of the Winchester hard drives. Inner tracks have smaller radius and thus length, so outer tracks offer higher number of storage points and are divided in more sectors. So for each revolution higher number of sectors are read from the outer tracks.
The performance I was able to get from this raid on the server looks like:
root@server:/# hdparm -tT /dev/VGNAME/nbd6.0 /dev/VGNAME/nbd6.0: Timing cached reads: 4146 MB in 2.00 seconds = 2073.21 MB/sec Timing buffered disk reads: 408 MB in 3.00 seconds = 135.88 MB/sec
Here goes the setup on the client side:
root@client:/# cat /etc/default/aoetools INTERFACES="eth1" LVMGROUPS="" AOEMOUNTS="" root@client:/# /etc/init.d/aoetools restart Starting AoE devices discovery and mounting AoE filesystems: Nothing to mount.
At this point /dev/etherd was populated and it was time for some tests.
root@client:/# hdparm -tT /dev/etherd/e6.0 /dev/etherd/e6.0: Timing cached reads: 3620 MB in 2.00 seconds = 1810.16 MB/sec Timing buffered disk reads: 324 MB in 3.01 seconds = 107.63 MB/sec
So .. WOW !
I was not expecting such performance. My hopes were around 50MB max. At this point I was wondering if the bottleneck was not on the server side since in several of my hdparm invocations on the server showed a performance just around 80MB(probably of times of some server load).
So let's create an in-memory ( and sparse ) file and export it:
root@server:/# dd if=/dev/zero of=6.1 bs=1M count=1 seek=3071 root@server:/# vblade 6 1 eth1 /dev/shm/6.1
The /dev/etherd/e6.1 device was created on the client
Lets' do the tests once again:
root@client:/# hdparm -tT /dev/etherd/e6.1 /dev/etherd/e6.1: Timing cached reads: 4006 MB in 2.00 seconds = 2003.68 MB/sec Timing buffered disk reads: 336 MB in 3.00 seconds = 111.85 MB/sec
Not too much difference so I guess I was lucky and hit the top
at my first try.
Lets also try a sequential write test:
root@client:/# dd if=/dev/zero of=/dev/etherd/e6.1 bs=1M count=1024 conv=sync,fsync 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB) copied, 10.0311 s, 107 MB/s
At the time of the tests the maximum network utilization reported by nload on the client was around 890 (outgoing) and 950 MBits (incoming). On the server it was 950 outgoing and 1330 (???) Mbits incoming .
/proc/net/dev on both the server and the client showed no errors or packet drops prior or after the tests.
I'm pleased to say that I'm astonished by the performance results from the isolated tests. A read/write speed of around 110-115 MB/s is more than enough for me when the theoretical maximum is around 125MB (before the exclusion Ethernet frame overhead). The CPU utilization of the vblade server process was around 50% of 1 core which is 1/8 of the available CPU resources. This also sound pretty good to me. I did not bother measuring the CPU utilization on the client as it happens inside the kernel ( with no dedicated thread to follow ). The tests were performed multiple times in order the results to be verified.
Unfortunately, I've started observing decreased write performance with AoE during real world tests. At first I've blamed NILFS, but when I did the tests with EXT4 the problem appeared again. So I've first tested the network throughput, which proved to be fine, and then did write tests ( dd if=/dev/zero of=/dev/etherd/e6.0 ) tests with the AoE device again. This time I have observed peaks and falls on the traffic graphs, with the bandwidth utilization from 10 to 900 Mbits. Sometimes it started fast, other times it ended fast, but the sustained rate was about 100 - 120 Mbits. I have tried various block sizes and tunning some kernel parameters with no real improvement. Searching the net showed that others also had write performance issues with AoE. This nice document - http://www.massey.ac.nz/~chmessom/APAC2007.pdf, shows that the most likely cause is the lack of Jumbo frames support of the network interfaces that I use. On the other side it also shows that others (e.g. iSCSI) could perform a lot better in a 1500 bytes MTU. So I wonder if the problem is in AoE protocol or in the software implementation. I could not easily switch Jumbo frames on, and there are not multiple AoE client implementations. I guess it is time to test ggaoed.
Fiber Channel over Ethernet
- http://www.phoronix.com/scan.php?page=article&item=ext4_btrfs_nilfs2&num=1 - NILFS2 Performance Benchmarks
- http://kernelnewbies.org/Linux_2_6_30 - The NILFS2 file system was merged in the 2.6.30 Linux kernel.
root@client:/# mkfs -v -t nilfs2 -L nbd6.0 /dev/etherd/e6.0
FS creation took about 16 minutes for a 500 GB file system (with the above setup) and actually created an ext2 file system !!! So let's try again:
root@client:/# time mkfs.nilfs2 -L nbd6.0 /dev/etherd/e6.0 mkfs.nilfs2 ver 2.0 Start writing file system initial data to the device Blocksize:4096 Device:/dev/etherd/e6.0 Device Size:536870912000 File system initialization succeeded !! real 0m0.122s user 0m0.000s sys 0m0.008s
Well, quite better - just about (16 * 60) / 0.122 = 7869 times faster.
root@client:/# mount -t nilfs2 /dev/etherd/e6.0 /mnt/protected/nbd6.0 mount.nilfs2: WARNING! - The NILFS on-disk format may change at any time. mount.nilfs2: WARNING! - Do not place critical data on a NILFS filesystem. root@client:/# df | grep etherd /dev/etherd/e6.0 500G 16M 475G 1% /mnt/protected/nbd6.0
Two things to notice here. First there is no initial file system overhead of several gigs as with ext2/3 and second the missing 25 gigs are for the 5% reserved space ( see mkfs.nilfs2 ) .
On the bad side. I've tried to fill the file system with data. After the first 70-80 gigs I have noticed the things were pretty slow (network interface utilization of about 50 Mbits) and decided to do FS benchmarks. The throughoutput I was able to achieve was from 5-10 MB/s for sequential writes. Pretty disappointing. I've also tried to tune /etc/nilfs_cleanerd.conf by increasing the cleaning_interval from 5 seconds to half an hour and the nsegments_per_clean from 2 to 800. Unfortunately it did not produce any measurable speedup.
I've also observed a network utilization of about 30 Mbits in
each direction while the FS was stale. Unmounting it stopped the
traffic. Remounting it made it show again. So I decided that the
cleaner process is doing it business after my "unconsidered" over
increase of the parameters. Sadly the traffic was there several
Additionally the number of the checkpoint was increasing without any file system activity (versus the statement in the docs).
I don't need the auto checkpoint feature at all but the docs did not show me a way to disable it. Doing manual "mkcp -s" and "rmcp" later will do the job for my needs. I guess this also obsoletes the cleanerd for my use case.
Anyway. I will try to contact the NILFS maintainers and the
community to see if anyone has a cure.
I could also implement a different solution, e.g. using LVM over the AoE device and using LVM snapshotting feature, but I would really like to give NILFS the chance it deserves.