User Tools

Site Tools


huy:zfs_df

Ду, духаст, du hast mich df

арыгинал был тут, но иво сцерла дедупликацыйа: http://www.c0t0d0s0.org/archives/6168-df-considered-problematic.html

I could summarize this article in a short sentence: Don't use df But i want to explain this to you in a slightly detailed manner. It's not that way, that df -k delivers wrong data. But df is based on a number of assumption, that are wrong in conjunction with data services like deduplication or compression. So you have to interpret the data differently.

The more interesting question at this point is “Do your scripts interpret the numbers correctly?”. I know there are countless scripts at customers out there monitoring disk space by using df that may need a rethought when used with modern filesystems like ZFS. So: What's the problem with df and what are the alternatives?

Magical harddisk enlargement Let's just assume you have a ZFS system with deduplication:

# zpool create testpool1 /export/home/jmoekamp/test1 /export/home/jmoekamp/test2 
# zfs set dedup=sha256 testpool1
# zfs set checksum=sha256 testpool1

I've created a ZFS based on files as a backend store just to demonstrate you the effect. At first we have an empty ZFS filesystem.

# zpool list testpool1
NAME        SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
testpool1   246M   142K   246M     0%  1.00x  ONLINE  -
# df -k /testpool1
Filesystem            kbytes    used   avail capacity  Mounted on
testpool1             219136      21  219062     1%    /testpool1

Now we copy a file into our newly created filesystem. I've choosed the wireshark binary. You can use any other file for this demonstration. It just should have a significant size. Showing the problem with small files is a little bit more tedious as you have to create more files.

# cp /usr/sbin/wireshark /testpool1/wireshark-1
# zpool list testpool1
NAME        SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
testpool1   246M  2.29M   244M     0%  1.00x  ONLINE  -
# df -k /testpool1
Filesystem            kbytes    used   avail capacity  Mounted on
testpool1             219136    2201  216860     2%    /testpool1

We've copied a single file into the pool. Our disk is 219136 KByte in size. The file takes round about 2201 Kbyte. We have still 216860 KByte available. Looks consistent. Okay … let's copy the file wireshark a second time into the filesystem.

# cp /usr/sbin/wireshark /testpool1/wireshark-2
# df -k /testpool1
Filesystem            kbytes    used   avail capacity  Mounted on
testpool1             221312    4380  216854     2%    /testpool1

Deduplication just works as designed. The second file didn't take roundabout 2.2 MBytes, it just took 6 Kbyte. The data was deduplicated by ZFS. But look at the used column: 4380 Kilobytes used. Obviously the amount of data is counted twice. This is correct too, because from a file system perspective there are two files in the size of 2.2 MBytes and not one. When you separate the files onto separate pools or system, you would have two times the 2.2 MBytes. But now it's getting interesting. Look at the second column, the “kbytes” column: This is the total size of the filesystem (free and used space). Normally the amount of kbytes in a filesystem doesn't change. But we have now a bigger filesystem after copying a file into it. The size of it was increased by 2176 Kilobytes. Interestingly exactly the size of file deduplicated by ZFS.

But that's not unreasonable as well. Let's assume we have a three blocks disk. i write a single block on it. I have 2 blocks left. No i write the same block on the disk. With a normal filesystem you would use another block. Thus you would have 1 block left, and two blocks on disk. A deduplicating filesystem has two blocks on disk as well, but it has still two free blocks. The only way to make sense out of this situation is to increase the size of the filesystem thus showing the filesystem with a total size of 4 blocks.

That has certain impact to the monitoring of the system. So using the kbyte column isn't reasonable as it's a moving target. used doesn't factor in deduplication. Nevertheless there is a way to get more reasonable data about the space consumption in your pool. It's zpool list:

#
# zpool list testpool1
NAME        SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
testpool1   246M  2.31M   244M     0%  2.00x  ONLINE  -

This tool correctly shows the size, the allocated space and the free space.

Misguiding percentages I didn't talked about the capacity column so far. I will now demonstrate to you how looking at the capacity column gives you a totally wrong impression. I wrote a small script just to copy a file again and again with a number appended to the filename. Of course this is the classic use case for deduplication: Let's look at the pool after a hundred copies:

# df -k /testpool1
Filesystem            kbytes    used   avail capacity  Mounted on
testpool1             438912  222297  216535    51%    /testpool1

When you look at the capacity column you would assume that your pool is 51% filled. But we just copied the same file again and again. Was there a problem with the deduplication?

No, of course not. When you look at the avail column you will see that despite of copying a 2.2 MByte file a hundred times, we just used 319 Kilobytes instead of round-about 220 MBytes. But from the perspective of the filesystem it's filled with 222297 KBytes worth of data.

But: Instead of having 219136 KByte pool we have now a 438912 KByte pool. So we don't have an almost empty disk as the block allocation by ZFS would suggest nor we have an full disk as a short calculation of the original size minus the size of the files copied in to the filesystem would suggest (or to be exact a disk a negative number of available blocks, as 219136 KBytes minus 222297 KBytes is -3161 KBytes). Empty would be totally wrong from filesystem perspective. So adding the used to the avail to get the size of disk (and to calculate the capacity column from there) is a reasonable design choice. BTW: I'm sure you are already aware of the explanation for the 51% used capacity.

Again zpool list is a much better tool to get some insights to your system.

# zpool list testpool1
NAME        SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
testpool1   246M  2.62M   243M     1%  102.00x  ONLINE  -

The 100 files just took 2.62 M after deduplication. The size of the pool is still the real one and the capacity calculation is more reasonable. Now let's push this situation to the extreme. I've restarted the script to generate 1000 copies of the wireshark file.

# df -k /testpool1
Filesystem            kbytes    used   avail capacity  Mounted on
testpool1            2606310 2401486  204728    93%    /testpool1

Well … 2.401.486 Kilobytes in a pool that was 219.136 KBytes initially. Due to the deduplication we've just used 11807 Kilobytes for this amount of data. Additional 11 MByte in a pool of 246 MBytes in size gives you capacity usage of 93% instead of 51% percent due to the way how this numbers are calculated. Again zpool list gives you much more reasonable numbers.

# zpool list testpool1
NAME        SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
testpool1   246M  5.69M   240M     2%  1102.00x  ONLINE  -

Digging in the source But why behaves this system this way? The reason is in the source code of df.c: There is a function called adjust_total_blocks starting at line 1224. The comment describe the basic problem:

   1215 /*
   1216  * The statvfs() implementation allows us to return only two values, the total
   1217  * number of blocks and the number of blocks free.  The equation 'used = total -
   1218  * free' will not work for ZFS filesystems, due to the nature of pooled storage.
   1219  * We choose to return values in the statvfs structure that will produce correct
   1220  * results for 'used' and 'available', but not 'total'.  This function will open
   1221  * the underlying ZFS dataset if necessary and get the real value.
   1222  */

When you look at the 1271 you will recognize how this total value is calculated:

size = _zfs_prop_get_int(zhp, ZFS_PROP_USED) + _zfs_prop_get_int(zhp, ZFS_PROP_AVAILABLE);

When you look into those properties you query those properties manually

# zfs get -p available testpool1
NAME       PROPERTY   VALUE  SOURCE
testpool1  available  209080832  -
# zfs get -p used testpool1
NAME       PROPERTY  VALUE  SOURCE
testpool1  used      2459780608  -

So the total size is 209080832 plus 2459780608 = 2668861440 or 2606310 KBytes. Now remember the last df -k output:

# df -k /testpool1
Filesystem            kbytes    used   avail capacity  Mounted on
testpool1            2606310 2401486  204728    93%    /testpool1

Now look at the line 1441 of df.c:

   1441 		(void) sprintf(capacity_buf, "%5.0f%%",
   1442 		    (total_blocks == 0) ? 0.0 :
   1443 		    ((double)used_blocks /
   1444 		    (double)(total_blocks - reserved_blocks))
   1445 		      * 100.0 + 0.5);

Let's compute this manually - 2459780608/2668861440*100+0.5 results to 92.6%. XCU and POSIX.2 mandate the rounding to the next integer: 93%. So you see, the output of df is perfectly correct. As i stated in the beginning, there are some assumption in df and one of them is the assumption that the value used refers to the physically used capacity on the media. But with deduplication this is a wrong assumption, as the physically used capacity may be less than the capacity shown by the filesystem (or let's say: logically used capacity).

Conclusion

I hope i was able to give you an example, why you shouldn't use df with ZFS. Or when you use it with ZFS: Know the informative value of this numbers. But in most cases, zpool list gives you the numbers that you really want.

I know - some people would say “Modify df!”. But can you really do this? df looks at the data from the filesystem perspective. But where should count a block that is referenced by multiple filesystems in a single pool? Difficult question. The most reasonable answer is: To any of the filesystems. A flattened (reduplicated) version of any deduplicated version would really use the amount of data designated by ZFS_PROP_USED. The same is valid for deduplication in a single filesystem. The latent size of a filesystem is the flattened, reduplicated size, not the deduplicated one. From my point of view the number presented by df are perfectly correct in the design goals of df, but they are incorrect in the way, many people use df.

zpool list looks from the pool perspective and doesn't have all this problems posed by the filesystem perspective. It just looks for allocated blocks when it tells you the allocated space.

By the way, you could say, this is an absolutely unrealistic use case. Really? Let's consider you store Windows desktop images on your fileserver for usage with your favourite virtualisation tool. You have a thousand desktops and all desktops are relatively equal (All use Windows 7 for example). Then you have a vast amount of duplicates, which would be deduplicated by ZFS. It's pretty much the same like with the wireshark file just with much larger files :-)

huy/zfs_df.txt · Last modified: 2020/11/11 09:22 by slayer