This page is a work in progress

Backup service

I'm in the process of implementing a backup service with these major requirements:

  • Is external and thus prone to hardware failures
  • Provides a time machine ( so older version of files could also be restored )

and the additional wishlist:

  • Backups are fast
  • The backup process is lightweight ( The servers are used in production and loaded all over the clock )
  • Service is reliable
  • Implementing it is as simple as possible
  • The interface is universal (e.g. it's better to use a filesystem than custom solution over dump/restore)

Of course the Perl motto "there is more than one way to do it" is valid for the major goals.

E.g. The external part could be done via:

  • some sort of network file system
  • synchronization via a network protocol to a file system living on external host

and the time machine could be done via

  • incremental backups (e.g. dump/restore)
  • a version control system, with Git being 1st in my list

My current idea is to use:

  • Software block device over the net ( External )
  • NILFS2 ( Time machine )

So I'm in a hunt for the:


Software block device over the net

Resources:

Requirements

  • Reliable
  • Fast
  • Simple ( avoid over-complication in implementation, configuration, features, dependencies .. )
  • Supported by Linux
    • Both server and client
    • Strongly preferred to be merged in mainline kernel
    • Strongly preferred tooling to be packaged in Debian

All the protocols listed below should be interchangeable. I might do some benchmarks at a later stage.


iSCSI

SCSI over internet


NBD

Network Block Device

Implementation

Exporting a device via NBD is a matter of:

root@server:/# apt-get install nbd-server
root@server:/# cat /etc/nbd-server/config
[generic]

[export0]
    exportname = /dev/mapper/vg0-nbd6.0
    port = 99
root@server:/# /etc/init.d/nbd-server restart

And importing it on a client is:

root@client:/# apt-get install nbd-client
root@client:/# grep -v '^#' /etc/nbd-client
AUTO_GEN="n"
KILLALL="true"
NBD_DEVICE[0]=/dev/nbd0
NBD_TYPE[0]=r
NBD_HOST[0]=SERVER-HOSTNAME
NBD_PORT[0]=99
root@client:/# /etc/init.d/nbd-client restart

You might want to check the manual pages in the respective packages for more configuration options and tweaks. E.g. the nbd-client init scripts has the feature to auto mount file systems.

Benchmarks

By default, nbd-client creates a blockdevice with a block size of 1024 bytes:

# On the client
blockdev --getbsz /dev/nbd0
1024

for ((i=0; i<10; i++)); do dd if=/dev/nbd0 of=/dev/null bs=1M count=1000 iflag=direct 2>&1 | grep bytes ; done
1048576000 bytes (1.0 GB) copied, 12.8387 s, 81.7 MB/s
1048576000 bytes (1.0 GB) copied, 14.1621 s, 74.0 MB/s
1048576000 bytes (1.0 GB) copied, 14.1721 s, 74.0 MB/s
1048576000 bytes (1.0 GB) copied, 15.6536 s, 67.0 MB/s
1048576000 bytes (1.0 GB) copied, 15.1352 s, 69.3 MB/s
1048576000 bytes (1.0 GB) copied, 15.5831 s, 67.3 MB/s
1048576000 bytes (1.0 GB) copied, 14.3358 s, 73.1 MB/s
1048576000 bytes (1.0 GB) copied, 15.256 s, 68.7 MB/s
1048576000 bytes (1.0 GB) copied, 13.9433 s, 75.2 MB/s
1048576000 bytes (1.0 GB) copied, 13.0245 s, 80.5 MB/s

# On the server
iostat -dk 10 | egrep '^(sd|Device)'
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             365.70     32194.80       380.00     321948       3800
sdb             316.20     31760.40       319.20     317604       3192
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             361.80     39333.20       281.20     393332       2812
sdb             323.20     39295.20       260.80     392952       2608
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             325.20     35762.80       238.40     357628       2384
sdb             274.90     35794.40       201.20     357944       2012

To summarize we have a performance of about 70-80 MB/s and the server is reading about 100KB in each request. The results are pretty much the same with 2048 bytes blocksize.
4k block size drops the transfer rate to 55 MB/s and keeps the 100 KB per IO op rate.

Lets remove the "direct" flag from dd:

# On the client
blockdev --getbsz /dev/nbd0
1024

for ((i=0; i<10; i++)); do dd if=/dev/nbd0 of=/dev/null bs=1M count=1000 2>&1 | grep bytes ; done
1048576000 bytes (1.0 GB) copied, 14.5043 s, 72.3 MB/s
1048576000 bytes (1.0 GB) copied, 18.6863 s, 56.1 MB/s
1048576000 bytes (1.0 GB) copied, 15.6981 s, 66.8 MB/s
1048576000 bytes (1.0 GB) copied, 15.8664 s, 66.1 MB/s
1048576000 bytes (1.0 GB) copied, 16.7602 s, 62.6 MB/s
1048576000 bytes (1.0 GB) copied, 18.382 s, 57.0 MB/s
1048576000 bytes (1.0 GB) copied, 17.1475 s, 61.2 MB/s
1048576000 bytes (1.0 GB) copied, 15.3853 s, 68.2 MB/s
1048576000 bytes (1.0 GB) copied, 19.3907 s, 54.1 MB/s
1048576000 bytes (1.0 GB) copied, 21.7969 s, 48.1 MB/s

# On the server
iostat -dk 10 | egrep '^(sd|Device)'
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             312.60     30968.40       173.60     309684       1736
sdb             284.80     30978.00       172.00     309780       1720
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             330.40     32506.40       166.00     325064       1660
sdb             280.60     32517.20       152.00     325172       1520
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             224.40     33598.80        51.60     335988        516
sdb             208.20     33604.40        60.80     336044        608

So this time we have around 60 MB/s with 100 KB per IO operation ratio (Note that the server is not totally idle and this is not the only disk activity it sees). With a block size of 2048 bytes this tests shows decreased speed of about 50 MB/s and the number of IO ops per second doubles. 4k block size gives us an average of 60 MB/s with 50 kb per IO op.

Lets do some write tests:

# On the client
blockdev --getbsz /dev/nbd0
1024

for ((i=0; i<10; i++)); do dd if=/dev/zero of=/dev/nbd0 bs=1M count=1000 oflag=direct 2>&1 | grep bytes ; done
1048576000 bytes (1.0 GB) copied, 10.1818 s, 103 MB/s
1048576000 bytes (1.0 GB) copied, 9.89168 s, 106 MB/s
1048576000 bytes (1.0 GB) copied, 9.73052 s, 108 MB/s
1048576000 bytes (1.0 GB) copied, 9.89912 s, 106 MB/s
1048576000 bytes (1.0 GB) copied, 9.91606 s, 106 MB/s
1048576000 bytes (1.0 GB) copied, 10.0242 s, 105 MB/s
1048576000 bytes (1.0 GB) copied, 9.95247 s, 105 MB/s
1048576000 bytes (1.0 GB) copied, 9.92473 s, 106 MB/s
1048576000 bytes (1.0 GB) copied, 10.0946 s, 104 MB/s
1048576000 bytes (1.0 GB) copied, 10.1183 s, 104 MB/s

# On the server
iostat -dk 10 | egrep '^(sd|Device)'
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             137.80         7.20     51806.80         72     518068
sdb             144.20         1.20     51798.00         12     517980
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             125.70        16.00     52375.20        160     523752
sdb             132.20         4.80     52362.80         48     523628
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             133.20         4.80     52117.60         48     521176
sdb             130.40         5.20     52265.20         52     522652

Write speed is 105 MB/s with about 500 KB per IO operation.
With block size of 2k and 4k the results of this tests stay the same.

And lets remove the "direct" flag while writing:

# On the client
blockdev --getbsz /dev/nbd0
1024

for ((i=0; i<10; i++)); do dd if=/dev/zero of=/dev/nbd0 bs=1M count=1000 2>&1 | grep bytes ; done
1048576000 bytes (1.0 GB) copied, 9.34019 s, 112 MB/s
1048576000 bytes (1.0 GB) copied, 15.3738 s, 68.2 MB/s
1048576000 bytes (1.0 GB) copied, 15.6453 s, 67.0 MB/s
1048576000 bytes (1.0 GB) copied, 20.3934 s, 51.4 MB/s
1048576000 bytes (1.0 GB) copied, 20.1742 s, 52.0 MB/s
1048576000 bytes (1.0 GB) copied, 19.0891 s, 54.9 MB/s
1048576000 bytes (1.0 GB) copied, 20.4181 s, 51.4 MB/s
1048576000 bytes (1.0 GB) copied, 16.8115 s, 62.4 MB/s
1048576000 bytes (1.0 GB) copied, 18.3555 s, 57.1 MB/s
1048576000 bytes (1.0 GB) copied, 20.0491 s, 52.3 MB/s

# On the server
iostat -dk 10 | egrep '^(sd|Device)'
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             242.30       667.60     28498.00       6676     284980
sdb             261.80       768.00     26874.40       7680     268744
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             236.70       639.60     28760.00       6396     287600
sdb             247.80       653.20     29739.20       6532     297392
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             257.60       760.00     20544.40       7600     205444
sdb             155.30       356.00     21658.40       3560     216584
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             325.80      1026.40     28021.20      10264     280212
sdb             136.60       238.80     26988.80       2388     269888

We see decreased write speed - around 50-60 MB/s and once again about 100 KB per IO operation.
The results are pretty much the same with block size of 2048 bytes.
Increasing the block size to 4k though rises the transfer speed to about 100 MB/s and give a nice 500 KB per IO request.

Next: Summarize the above results in a nice table and test with real files and filesystem

Block Size Sequential read Sequential Read + idirect Sequential write Sequential write + idirect
1k 60 MB / 100 KB 75 MB / 100 KB 55 MB / 100 KB 105 MB / 500 KB
2k 50 MB / 50 KB 75 MB / 100 KB 55 MB / 100 KB 105 MB / 500 KB
4k 50 MB / 50 KB 55 MB / 100 KB 100 MB / 500 KB 105 MB / 500 KB

Security

Securing who can access the device is a different story though. The server implementation does not support any authentication. Well it does support IP based ACLs but that is nothing since in most configurations IP addresses could be easily spoofed. I don't see much point in putting such ACL in the server, as it could be easily and more reliably be implemented in the firewall.

So if you want/need security with NBD you should:

  • On the server: make sure you limit the access to the TCP port the server is listening on. E.g. only allow certain interface (explicitly disallowing the "lo" interface might also be a good idea) and only allow certain IP address and or MAC addresses.
  • On the network: make sure the IP and or MAC addresses that are in the server ACL could not be spoofed. E.g. provide a dedicated wire/vlan/etc and/or use managed switches to guarantee the path to the clients and servers.
  • On the network: If you intend to route NBD traffic via some public network you might want to add additional layer of encryption/authentication. IPSec or another tunneling scheme sound useful.
  • On the client: It might be a useful idea to limit the NBD traffic to a particular UID (most probably root). This is especially important if you have some untrusted apps running.

Notes

nbd-server in Debian testing (as of 100110) does not support the SDP (Socket Direct Protocol) so TCP/IP is used for the tests. SDP is claimed to offer a better performance.

I've read somewhere that NBD is not particularly good in case of connection problems.


DST (Obsolete)

DST stands for Distributed STorage

Resources:

Merged in (recent) 2.6.30 kernel. Update: Unfortunately it was removed as of the 2.6.33 kernel.
As far as I can see from various resources it is implemented as alternative of NBD and iSCSI.

Its author ( Evgeniy Polyakov ) looks like a good hacker and when a good hacker feels that he has to come with a new implementation there must be something wrong with the old one.

Performance tests done by the DST author show that aoe performs better though, so aoe is probably the first thing that I will try.

DST looks like the second option I will try as I also plan to implement similar backup solution in a distributed environment over insecure channels.

Notes:

  • New and probably unstable implementation. As of Linux v2.6.32 it is still int the "staging" area.
  • Native encryption support so is usable over insecure channels
  • Both the client&server are implemented in the kernel
  • Single vendor

AoE

ATA over Ethernet

Resources:

Notes:

  • Multiple vendors for the server implementation
  • The client is implemented in the kernel ( "aoe" module )
  • I will probably try the vblade+aoetools option first as it is already packaged in Debian.
  • GGAOED has Debian build scripts
  • TODO: Post some results from real world tests

AoE works in layer 2 (Data Link - Ethernet) directly, bypassing the processing overhead of upper layers (IP, TCP/UDP).
This is a candidate for a performance boost but it also has some drawbacks.
E.g. it could not be easily passed trough routers. Even if Ethernet in IP tunneling is used a TCP fragmentation will likely occur which will probably slow things down. Looks suitable for usage within the data center where performance is needed and the client and the server will either be directly connected or will be interconnected via a good switch supporting jumbo frames.

Security

The AoE protocol is insecure by design and it is stateless.
So if we want security we should use some additional measures.

Security of the storage

To guarantee the security of the storage we could think of some sort of isolation of the path.
Several options come to my mind:

  • A dedicated Ethernet interfaces and a dedicated wire between client and the server
  • VLAN isolation
  • MAC filtering on the server and on the switch(es)

With the first one, of course, being the most secure ( switches could also be penetrated ) .

The MAC filtering could be easily misused. If you do the filtering only on the server, then any other host within the network could be reconfigured to become a client.

The path isolation will guarantee that a breach in another host in the same LAN segment will not compromise the storage.

Security of the data

The data security is another topic. Although a man in the middle attack does not look too probable within the data center you might prefer to be paranoiac ( or you might simply have a different setup requiring it ). For this case you could always add additional layer of encryption on the client for the cost of more CPU cycles and probably slightly increased latency.

One additional aspect bugged me.
How about if a user account on the client host gets compromised ? Could it be used to run a AoE client in userspace to gain access to the data?
Thankfully no. The access to the server is done via raw sockets and a dedicated ethertype. The creation of the RAW sockets under Linux requires the CAP_NET_RAW privilege which is usually granted only to root.

Implementation

Both machines are Dell PowerEdge R200:

  • 1U
  • 1 Intel Xeon CPU X3320 @ 2.50GHz with 4 cores
  • 4 GB of memory.
  • Debian GNU/Linux testing/Squeeze
  • 2.6.30-2-686-bigmem kernel package
  • 2 x Broadcom NetXtreme BCM5721 ( 1Gbit, No jumbo frame support )
  • 2 HDDs each of them being:

    Model Family: Seagate Barracuda ES.2 Device Model: ST3750330NS
    Firmware Version: SN05
    User Capacity: 750,156,374,016 bytes

The servers are connected via a dedicated wire.

The network interfaces are at:

root@client:/# ethtool eth1
Settings for eth1:
    Speed: 1000Mb/s
    Duplex: Full
    Port: Twisted Pair
    Auto-negotiation: on
    Link detected: yes

Both systems were not completely stale during the tests.

Here goes the block device exportation:

root@server:/# lvcreate --verbose --size 500G --name nbd6.0 VGNAME /dev/md8 /dev/md9
root@server:/# vblade 6 0 eth1 /dev/VGNAME/nbd6.0 2>&1

md8 is soft raid0 (stripping) over 2x150 GB partitions at the end of the HDDs. So is md9. Two physical HDDs are used in total. The soft raid is added for performance. The partitioning is done for easier relocation of parts of the space.
The partitions being at the end of the drive gives roughly 1.5x to 2x performance penalty for sequential operations. This is due to the circular design of the Winchester hard drives. Inner tracks have smaller radius and thus length, so outer tracks offer higher number of storage points and are divided in more sectors. So for each revolution higher number of sectors are read from the outer tracks.

The performance I was able to get from this raid on the server looks like:

root@server:/# hdparm -tT /dev/VGNAME/nbd6.0                            
/dev/VGNAME/nbd6.0:
 Timing cached reads:   4146 MB in  2.00 seconds = 2073.21 MB/sec
 Timing buffered disk reads:  408 MB in  3.00 seconds = 135.88 MB/sec

Here goes the setup on the client side:

root@client:/# cat /etc/default/aoetools
INTERFACES="eth1"
LVMGROUPS=""
AOEMOUNTS=""

root@client:/# /etc/init.d/aoetools restart
Starting AoE devices discovery and mounting AoE filesystems: Nothing to mount.

At this point /dev/etherd was populated and it was time for some tests.

root@client:/# hdparm -tT /dev/etherd/e6.0
/dev/etherd/e6.0:
 Timing cached reads:   3620 MB in  2.00 seconds = 1810.16 MB/sec
 Timing buffered disk reads:  324 MB in  3.01 seconds = 107.63 MB/sec

So .. WOW !
I was not expecting such performance. My hopes were around 50MB max. At this point I was wondering if the bottleneck was not on the server side since in several of my hdparm invocations on the server showed a performance just around 80MB(probably of times of some server load).

So let's create an in-memory ( and sparse ) file and export it:

root@server:/# dd if=/dev/zero of=6.1 bs=1M count=1 seek=3071
root@server:/# vblade 6 1 eth1 /dev/shm/6.1

The /dev/etherd/e6.1 device was created on the client automagically.
Lets' do the tests once again:

root@client:/# hdparm -tT /dev/etherd/e6.1
/dev/etherd/e6.1:
 Timing cached reads:   4006 MB in  2.00 seconds = 2003.68 MB/sec
 Timing buffered disk reads:  336 MB in  3.00 seconds = 111.85 MB/sec

Not too much difference so I guess I was lucky and hit the top at my first try.
Lets also try a sequential write test:

root@client:/# dd if=/dev/zero of=/dev/etherd/e6.1 bs=1M count=1024 conv=sync,fsync
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 10.0311 s, 107 MB/s

At the time of the tests the maximum network utilization reported by nload on the client was around 890 (outgoing) and 950 MBits (incoming). On the server it was 950 outgoing and 1330 (???) Mbits incoming .

/proc/net/dev on both the server and the client showed no errors or packet drops prior or after the tests.

I'm pleased to say that I'm astonished by the performance results from the isolated tests. A read/write speed of around 110-115 MB/s is more than enough for me when the theoretical maximum is around 125MB (before the exclusion Ethernet frame overhead). The CPU utilization of the vblade server process was around 50% of 1 core which is 1/8 of the available CPU resources. This also sound pretty good to me. I did not bother measuring the CPU utilization on the client as it happens inside the kernel ( with no dedicated thread to follow ). The tests were performed multiple times in order the results to be verified.

Unfortunately, I've started observing decreased write performance with AoE during real world tests. At first I've blamed NILFS, but when I did the tests with EXT4 the problem appeared again. So I've first tested the network throughput, which proved to be fine, and then did write tests ( dd if=/dev/zero of=/dev/etherd/e6.0 ) tests with the AoE device again. This time I have observed peaks and falls on the traffic graphs, with the bandwidth utilization from 10 to 900 Mbits. Sometimes it started fast, other times it ended fast, but the sustained rate was about 100 - 120 Mbits. I have tried various block sizes and tunning some kernel parameters with no real improvement. Searching the net showed that others also had write performance issues with AoE. This nice document - http://www.massey.ac.nz/~chmessom/APAC2007.pdf, shows that the most likely cause is the lack of Jumbo frames support of the network interfaces that I use. On the other side it also shows that others (e.g. iSCSI) could perform a lot better in a 1500 bytes MTU. So I wonder if the problem is in AoE protocol or in the software implementation. I could not easily switch Jumbo frames on, and there are not multiple AoE client implementations. I guess it is time to test ggaoed.


FCoE

Fiber Channel over Ethernet

etc.


NILFS2

Resources:

Implementation

root@client:/# mkfs -v -t nilfs2 -L nbd6.0 /dev/etherd/e6.0

FS creation took about 16 minutes for a 500 GB file system (with the above setup) and actually created an ext2 file system !!! So let's try again:

root@client:/# time mkfs.nilfs2 -L nbd6.0 /dev/etherd/e6.0
mkfs.nilfs2 ver 2.0
Start writing file system initial data to the device
   Blocksize:4096  Device:/dev/etherd/e6.0  Device Size:536870912000
File system initialization succeeded !!

real    0m0.122s
user    0m0.000s
sys     0m0.008s

Well, quite better - just about (16 * 60) / 0.122 = 7869 times faster.

root@client:/# mount -t nilfs2 /dev/etherd/e6.0 /mnt/protected/nbd6.0
mount.nilfs2: WARNING! - The NILFS on-disk format may change at any time.
mount.nilfs2: WARNING! - Do not place critical data on a NILFS filesystem.
root@client:/# df | grep etherd
/dev/etherd/e6.0      500G   16M  475G   1% /mnt/protected/nbd6.0

Two things to notice here. First there is no initial file system overhead of several gigs as with ext2/3 and second the missing 25 gigs are for the 5% reserved space ( see mkfs.nilfs2 ) .

On the bad side. I've tried to fill the file system with data. After the first 70-80 gigs I have noticed the things were pretty slow (network interface utilization of about 50 Mbits) and decided to do FS benchmarks. The throughoutput I was able to achieve was from 5-10 MB/s for sequential writes. Pretty disappointing. I've also tried to tune /etc/nilfs_cleanerd.conf by increasing the cleaning_interval from 5 seconds to half an hour and the nsegments_per_clean from 2 to 800. Unfortunately it did not produce any measurable speedup.

I've also observed a network utilization of about 30 Mbits in each direction while the FS was stale. Unmounting it stopped the traffic. Remounting it made it show again. So I decided that the cleaner process is doing it business after my "unconsidered" over increase of the parameters. Sadly the traffic was there several hours later.
Additionally the number of the checkpoint was increasing without any file system activity (versus the statement in the docs).
I don't need the auto checkpoint feature at all but the docs did not show me a way to disable it. Doing manual "mkcp -s" and "rmcp" later will do the job for my needs. I guess this also obsoletes the cleanerd for my use case.

Anyway. I will try to contact the NILFS maintainers and the community to see if anyone has a cure.
I could also implement a different solution, e.g. using LVM over the AoE device and using LVM snapshotting feature, but I would really like to give NILFS the chance it deserves.