LFS

The Design and Implementation of a Log-Structured File System

Date: October 10, 2007

(1)

1.What is the problem the authors are trying to solve?

當我們要將資料寫入Log時，此時必須確保系統能有大量的空間可以讓資料寫入Log，這會是Log-Structure

System面臨到的問題，於是該篇作者提出了Segment Cleaner Process的觀念來確保系統中能不斷產生Empty

Segment

2.What other approaches or solutions existed at the time that this work was done?

FFS(Unix fast file system)

3.What was wrong with the other approaches or solutions?

在現有的檔案系統有2個問題:第1個是他們將資訊擴張到整個磁碟，引起太多很小的存取動作，第2個是

試著去同步寫入資料時，應用軟體必須等待之前的寫入動作完成後，才可以繼續寫入，而不是可以在背

景中寫入。

4.What is the authors' approach or solution?

Log Structured 檔案系統的運作原理是用循務的方式將資料寫入磁碟，我們簡稱這樣的架構就

是Log，取決於磁碟traffic的狀況中，其中一項是寫入的速度，所以如果能排除讀寫頭在seek的

時間的話，就可以增加寫入資料的效能

5.Why is it better than the other approaches or solutions?

在這篇paper所提到的Log觀念並不是第一次提出，之前就有一些檔案系統使用Log架構來加快寫入磁

碟的動作，但是那些檔案系統只是把Log當作是暫時性的儲存系統而已，照常理來說，磁碟所存放的

資料必須是可以永久保存的，所以此篇論文作者提出Log Structured System來解決這樣的問題，可

以在Log中永久存放資料，而且也不需要有任何結構放在磁碟中，再加上Log擁有索引的能力，所以

和其他同樣使用Log檔案系統比起來，Log Structured System當然是優於它們的

6.How does it perform?

為了證明怎麼執行，他們建構出Log Structured檔案系統的雛形出來，稱做Sprite LFS，用Sprite LFS

和FFS來比較效能

7.Why is this work important?

因為我們都知道磁碟的效能是遠遠落後於CPU的效能，為了不要讓磁碟的效能和CPU的效能落差太大，

造成系統效能的瓶頸，所以發展可以改善磁碟效能對未來的電腦發展是很重要的

8.Can any improvement be done?

No

(2)

1.What is the problem the authors are trying to solve?
Ans:Cpu的速度的提升是呈指數般的提升；相反的，在disk的效能改進上則呈現
相當大的瓶頸。disk的改良主要是在容量及價錢上，但是速度沒有很大的進步，
這主因在於disk物理上的限制，機械式的設計使得disk速度提升上相當有限。
這二者極大的差異將會使得電腦整體的效能被限制在disk的存取速度上
(disk bound)。雖然disk的物效能改進上有限，但是可以藉由一些檔案的存取機制
及資料結構設計來達成提升效能的目的。

2.What other approaches or solutions existed at the time that this work was done?
Ans: The Berkeley Unix fast file system。Unix FFS使用一個叫作i-node的資料結構。
這個資料結構獨立於一般檔案的內容，它是索引中的一個entry，裡面包含著檔
案的名字及各種相關資料如檔案型式、所有者、權限等。I-node紀錄了的disk
Adderss以及indirect blocks。Indirect blocks會指向下一個i-node的位址，如此
便可以將大量的檔案資料紀錄起來。而i-node是儲存於disk中的固定位置(而且
是非連續的)，因此可以藉由簡單運算計算出檔案在disk中的位址。

3.What was wrong with the other approaches or solutions?
Ans:前述使用Unix FFS的方法有二個缺點：1是它在存取檔案時，只有利用到5%
的disk’ potential bandwidth，剩下的部分都是在作搜尋的動作，對於bandwidth
的利用來說，相當浪費。2是Unix FFS雖然在寫入檔案時是可以非同步的，但是
Unix FFS在檔案結構上使用的metadata(i-nodes、directories)卻是必須同步更新。
在大量小檔案的存取中，這會使得效能受到影響，並且會影響到像是write buffer
的使用。

4.What is the authors' approach or solution?
Ans:他使用一種叫作log-structured file system的technique，這是disk中唯一的資
料結構。由於記憶體容量越來越大，檔案常常可以存在記憶體中，所以檔案的讀
取動作越來越有效率。但是寫入檔案資料卻不是如此，因此效能便決於定寫入檔
案的效能。在log-structured file system中，它將新資訊用一個連續性的檔案結構
稱作log，而當要寫入資料時，只要作一次搜尋的動作，接著就可以一次寫入資
料，節省了許多時間。而在讀取時，log-structured file system包含indexing
Information，這將使得檔案資料從disk讀回記憶體更有效率。而在作寫入的動作
時，因為log是個連續性的結構，所以通常需要較大、連續性的儲存空間，因此
使用segments，然後搭配一個叫作segments cleaner的policy去找到閒置的空間
來儲存檔案資料。

5.Why is it better than the other approaches or solutions?
Ans:在Simulation中，Sprit LFS(a prototype log-structured file system)在寫入大量小
檔案的效能優於傳統Unix的檔案系統。而即使是較大的檔案或是作讀取的動作，
Sprit LFS也不輸給傳統的Unix系統(唯在亂數寫入資料後循序讀取資料這個case
例外)。甚至在寫入較大的檔案時，由於沒有segment clean的overhead，所以效能有時更好。
而寫入新資料時在disk bandwidth的使用上，傳統的Unix系統只有利用到5-10%，但是Sprit LFS
卻能利用65-75%。而在crash recovery方面，傳統的Unix是搜尋整個disk然後再作restore的動作；
但是Sprit LFS在發生crash後只需檢查最近寫入的log部分，然後即可作recovery。

6.How does it perform?
Ans: Sprit LFS將disk分成固定的大區塊，並且稱每個區塊為segment，同時每一
個區塊在被寫入時必須是連續的。為了能夠了解檔案資料的狀態，會使用inode
map去紀錄每個檔案的位置。有時需寫入的檔案資料很大或是檔案資料過於分
散，這時就必須使用segment cleaning將free segments串接在一起形成一個連續
性的segments。

7.Why is this work important?
Ans: log-structured file system的效能比現今多數的檔案系統效能為佳，尤其在處
理大量的小檔案存取時，同時在crash recovery上，可以迅速地完成。在目前cpu
速度明顯快過disk的情況下，log-structured file system可以使得效能受到disk的
影響降最為最小。

8.Can any improvement be done?

(3)

Q1.What is the problem the authors are trying to solve?
A: CPU speed is increased fast ,but disk access times is improved so
slowly.
Current file systems use disk's bandwidth hava lower performance.
The authors designed a new disk storge management technique to
exhance more efficiently.

Q2.What other approaches or solutions existed at the time that this work was done?
A: The Unix Fast File System was effective at lay out each file sequentially on disk.

Q3.What was wrong with the other approaches or solutions?
A: The bandwidth is used for new file less than 5%.Almost time is spend seeking.
And file system write data synchronously so that application have to wait for
the write to finish.

Q4.What is the authors' approach or solution?
A: The authors designed a new file system called a log-structured file system.
The LFS writes a new information to disk in a sequential structure called the "log".
The LFS is buffering a sequence of file system changes in the cacheand the writing
all the changes to disk sequentially in a disk to improve write performance.

Q5.Why is it better than the other approaches or solutions?
A: The log have indexing information so that the files can be read back efficiently.
The LFS only check the most recent portion of the log after a crash.

Q6.How does it perform?
A: The authors used a benchmark programs and long-term measurements of cleaning overhead
to estimate Sprite LFS.

Q7.Why is this work important?
A: After a system crash occurs,traditional Unix file systems is without log so that the
systems can't determine where the last file changes were made, so the systems scan all
of the metadata on disk to reorganize.
The LFS finds the last disk accesses from the end of log so that it is recovery system
quickly.

Q8.Can any improvement be done?
A: No.

(4)

1. What is the problem the authors are trying to solve?
Author try to solve problem about access little file with too much acess time,
and when system crash,OS need a lot of time to check whole Disk.

2. What other approaches or solutions existed at the time that this work was done?
Berkerly Unix Fast File system(FFS).

3. What was wrong with the other approaches or solutions?
The performance of FFS is slow when reading small file,because when reading an inode
the data blocks are separated everywhere,so it takes many times to seek all the data
blocks.

4. What is the authors' approach or solution?
Author's approch is to get inode,data block,directory,together in a log,so
when reading small file,it takes only 1 large sequential reading,unlike FFS need to
take 5 to 10 times reading and accessing block.

Write Buffer allow asynchronous write,which means writing alot of data once,
unlike FFS write data one by one.

5. Why is it better than the other approaches or solutions?
因為LFS在讀取小檔案時,可以節省硬碟的 access time,而FFS不行.

6. How does it perform?
作者的 LSF 主要精神在於平時把 inode,inode data,directory 存放在一堆,於是讀檔只要讀一次就可以了.
常常整理 disk ,把分散的 block 放在一起,於是就可以清出很多空間,就能存放大資料進去.

存取資料.
要讀data要先到 log 輸入 file id 然後得知 inode map 再取得 inode address.

空間分配.
每幾十秒做一次 threading 或 copying
threading 是把 log end 向後移,並且不更動 block 內容,此方法用於長期存在,且不常改變的data.

copying 是把 log end 向後移,並且把 block 都相一邊集中,所以可以清出許多 block 的空間,於是
下次要存放大的 file 就可以比較好找空間,此方法用於常常改變的資料.

7. Why is this work important?
The work is important when accessing small file in LSF can improve disk bandwidth utilization,
and when system crash,the log can be used to recover file.

8. Can any improvement be done?
No

(5)

Q1. What is the problem the authors are trying to solve?
A:
The authors want to design a file system for increse disk bandwidth that when
access new data and to solve traffic dominated by metadata with small files.

Q2. What other approaches or solutions existed at the time that this work was
done?
A:The related works are listed in the following:
1.A crash recovery scheme for a memory-resident database system
2.Reimplementing the cedar file system using logging and group commit.

Q3. What was wrong with the other approaches or solutions?
A:
1.must useing an auxiliary structure to speed up writes and crash recovery.
2.the cedar file system use the log only for temporary storage;others using
traditional random-access storage structure on disk.

Q4. What is the authors' approach or solution?
A:The authors' approach is to design Sprite LFS.
Sprite LFS useing inode map to maintain the current location of each inode and
using checkpoint region to addressing all inode map blocks.
Sprite LFS uses a hybrid scheme that Disk divided into fixed size segments(512KB) and
Segment size chosen.using Segment Cleaning Policies to maintain Free Space.
uses Checkpoints and roll-forward to accomplish CRASH RECOVERY.

Q5. Why is it better than the other approaches or solutions?
A:
Log-structured file system can Writes much larger amounts of new data to disk
per disk I/O and uses most of the disk’s bandwidth. Free space management done
through dividing disk into fixed-size segments.
Lowest segment cleaning overhead achieved with cost-benefit policy and crash
recovery faster than existing file system.

Q6. How does it perform?
A:
Based on Micro-Benchmarks, Sprite LFS outperforms than SUN OS almost ten times
in Small-file performance.

Q7. Why is this work important?
A:Because log-structured file systems can Uses most of the disk’s bandwidth
by avoiding costly seeks and crash recovery didn't seek entire disk.

Q8. Can any improvement be done?
A: The author assumption is not true for all workloads:
if user would like to use flash memory as storage,these seek limitations alomost to zero.

(6)

Q1.What is the problem the authors are trying to solve?
A:近十年來,CPU的速度的成長非常快速,然而DISK的速度只有緩慢的成長,雖然讀取的效率可以
用檔案預測的方式提升,但大量小檔案的寫入花很多時間在seek,作者是要是要改善小檔案寫入
的效能.

Q2.What other approaches or solutions existed at the time that this work was done?
A:The Sprite LFS design borrow ideas form many different storage managment systems.
1.FINLAYSON, R. S. AND CHERITON, D. R. Log files: An extended file service
exploiting write-once storage.
2.REED, D. AND SVOBODOVA, L. SWALLOW: A distributed data storage system for a local
network.

Q3.What was wrong with the other approaches or solutions?
A:在1990年代的檔案系統有2個普遍的問題存在
1.File information is spread around the disk that causes too many small accesses.
2.Current file systems is tend to write synchronously.

Q4.What is the authors' approach or solution?
A:The authors' approach is to design the Log-Structured File System,and constructed
a prototype one called Sprite LFS.

Q5.Why is it better than the other approaches or solutions?
A:Log-structured file system 將許多在傳統檔案系統隨機存取的小檔案,轉換成連續大的傳
輸,可以將disk bandwidth接近100%的利用.

Q6.How does it perform?
A:A log-structured file system is buffering a sequence of changes in the file cache,
and write it all at once, include file data , inodes, etc.

Q7.Why is this work important?
A:Office and engineering的應用程式大多是存取只有幾K的檔案,而Disk的速度相對CPU卻是非常
的緩慢,改善Disk存取效能,才能充份發揮CPU高速的處理能力.

Q8.Can any improvement be done?
A:No.

(7)

Q1. What is the problem the authors are trying to solve?
A: 由於cpu的速度越來越快, 為了彌補低速裝置資料來不及提供給cup, 勢必要使用cache
, 預測是否正確, 影響到cache對系統是否有幫助,因此若有更好的預測方法, 將能提高
系統效能.作者提出目前最好的預測方式.

Q2. What other approaches or solutions existed at the time that this work was
done?
Last-Successor把最後一個存取的檔案, 預測為下一次會存取的檔案
FMOC依照被存取正確的分數, 決定下一個被存取的檔案
有些使用機率來決定
有些使用分析檔案結構
有些使用特別設計的編譯器

Q3. What was wrong with the other approaches or solutions?
使用分析檔案結構, 或是求助於編譯器, 無法彈性動態的使用, 這方面probability-based
的較好, 但probability-based還是很有機會預測錯誤. 因為它不知道"為何"做此預測, 只
因為它機率高就排上去

Q4. What is the authors' approach or solution?
每個檔案自己本身紀錄存取它的程式的名稱, 並紀錄這個程式存取的下一個檔案是誰.
紀錄多對的<program name , successor-list>在檔案的metadata裡面, 這樣子當某
程式第二次讀第一個檔案時, 第二個檔案將準確的被預測到

Q5. Why is it better than the other approaches or solutions?
錯誤的預測是要付出代價的, 相較於probability-based的預測方式, PLS的錯誤預測
比LS低15~22%, 較低的錯誤預測使PLS比其他方法好

Q6. How does it perform?
追蹤open和execve事件, 並對pls和ls座同樣的計分, 對正確預測+1, 對錯誤預測+0, 並對
最後分數正規劃. 做10到13個月的追蹤

Q7. Why is this work important?
如題, 因為硬碟是機械式的, 它的速度差cpu太多, 造成整個電腦效能卡在那邊, 十分殘念
, 因此如果有準確的預測方式來提升cache的效用, 對提升整體速度是非常重要的.

Q8. Can any improvement be done?
上作業系統時老師說過不可能成為先知, 完全正確預測未來要用哪個檔案, 如果我們有超大
記憶體 , 就完全不會有這些問題. 但有了超大記憶體後, 又會有記憶體和CPU之間的cache
問題...

(8)

Q1.What is the problem the authors are trying to solve?
A:
This author concerned about the problem of the time consumed by the disk write is quiet not efficiency.
The time of the disk read will be less than write due to the memory size is increasing (most data were cached in the main memory).
The chance of fetch data from disk will decrease either.
Author emphasized that the main factor which dominate the disk traffic is the way how the disk write data.
Too much effort to write data into disk, they have to seek the available segment, write the file and directory attributes then write the file data.
This approach not only improve the time of disk read, it help to recovery the file system as they crash.
In the traditional approach, system have to seek the blank (or available) segment to store the new data .
Authors feel the time for seeking the blank segment is an waste. They try to help for saving the seek time to write new information.
Current UNIX system will try to scan whole disk to collect the relative information for the recovery task.
By using the mechanism of the Log-Structured File System, we only have to check the tail of the log to find out the data which is really need to recover.

Q2.What other approaches or solutions existed at the time that this work was done?
A:
There are two similar research on trying to solve the issues that the author does.
a. Reimplementing the cedar file system using logging and group commit.
b. Decorum file system architectural overview.
These two approach keep the log in temporary storage, unlike the author, they remain all the log in the disk permanently.

Q3.What was wrong with the other approaches or solutions?
A:
There are few solutions tried to speed up the disk transfer bandwidth like disk array or parallel-head. But the actual problem is the structure of the file system.
These approach does help to enhance the speed of transfer the data among disk and processor.But if the file system could be improved either, there could save more time.

Q4.What is the authors' approach or solution?
A:
Two main targets were been aimed by the authors.
a. Improve the efficiency of writing the data into the storage(Hard Disk).
b. Fast recovery time when system crash.

For the target a., authors try to use a way to write the data into disk in a more efficient method.
Authors only want to improve the phenomenon that system have to access hard disk while the application write several small data.
This method just like cargo jet to take whole freight to destination rather then using tiny air plane to take several fly.
They try to write these data first then carry them to the disk at once.
Authors must find a way to figure out the disk free space due to they need more space then other approaches.
In current file system, every data write process will trigger a disk scan process for finding a available segment to store the data.
The log system trace any move in the disk all the time, they will find out the segment utilization and log it.
If the utilization is reach the threshold predefined, it will execute a process called 'segment cleaner'. The threshold value will affect the performance of the system, authors simulate the relative threshold to find the relationship among these factors.
That will clean the segment including remove the data labeled as delete and rewritten then move the live data to other segment.
Let the live data stored in the neighbor segment will improve the read data efficiency either. (Authors said that idea was came from the concept "garbage collection" of coding language)

For the target b., authors modification the recovery method that current UNIX system did.
In UNIX system, it has to read all metadata in disk every time it crashed.UNIX system did not know the last change as system crash.
Authors will keep the relative information into log for fast recovery.
By checking the log (tail of the log),system could figure out the situation and proceed the recovery task.
This approach will decreased the time on recovery file system. (Especially for the large storage system.)

Q5.Why is it better than the other approaches or solutions?
A:
This approach really point out the key issue that the current file system (or said 1990s actually) meet.
The system performance will depends on the hard disk, especially the disk write time.
Other approach try to solve the problem about how to pre-fetch data to disk and this only improve access time of disk read.
Authors described that disk write request will cost much more than disk read.
If the time that disk write could decrease and it will raise system performance better than other approach.

Q6.How does it perform?
A:
The simulation result shown that the file system "Sprite LFS" outperform than UNIX system (SunOS in the simulation) in many way.
In large amount small size files. (10000 files with 1K size)
They create 10000 files in file system then read this files back, delete these files finally.
File system "Sprite LFS" almost faster 7 times than SunOS in creating new file.
1.4 times fast than SunOs in reading these files back.
8 times fast than SunOs in deleting these files.

Q7.Why is this work important?
A:
This approach sacrificed some system resource (cost) to solve issue. But it does really has it's contribution by saving waste time that old file system produced.
According to this paper, only 5% disk transfer bandwidth was utilized. This approach will raise this value close to 100%.
And of course this approach have to spent the cost like clean segment to make more larger free disk space for writing larger data in.
I think this paper's value is that it higher file system efficiency dramatically.

Q8.Can any improvement be done?
A:
It's great thinking and approach. And it take many idea from other application systems.(Like roll-forward from database system "Redo")
And this approach improve two methods that current file system did. I can't have any other idea about further improvement for this approach.

(9)

Q1. What is the problem the authors are trying to solve?
A: 由於cpu越來越快, 效能的瓶頸還是卡在硬碟. 讀的部分可以透過加大主記憶體, 加
大快取正確來改善, 因此作者著重在解決寫入效能的問題

Q2. What other approaches or solutions existed at the time that this work was
done?
硬體方面:
加大transfer bandwidth, 使用disk arrays, 使用parallel-head disks
但無真正改進access time
加大主記憶體
使用write buffer收集足夠的變化一次寫入, 但crash以後資料會遺失
Unix Fast File System

Q3. What was wrong with the other approaches or solutions?
硬體方面, 由於是機械動作, 不能指望它能改進, 記憶體加大只能解決讀的問題, 寫的
速度還是一樣慢, Unix FFS寫檔時物理上是分散的, 因此會做多次的I/O動作, 硬碟的
seek是很慢的, 再來是他必須同步等待這些動作完成

Q4. What is the authors' approach or solution?

1.使用write buffer, 蒐集足夠的改變, 一次sequentially寫入硬碟, 最好用掉100%的
bandwidth
2.除了沒有bitmap和free list外, LFS使用和FFS差不多的機制
3.LFS使用inode map管理每個inode的位置, 並寫入log, inode map cached在主記憶體
4.一次寫入一個segment, 如果是live data則fragment不管他, 否則將之copy and compact
5.使用cleaning mechanism>讀segments到memory>辨識live data與否, 否則寫入乾淨的
segment>把此segment mark為clean
6.使用checkpoint, 大約30秒紀錄一次, 系統crash以後只要比對checkpoint即可

Q5. Why is it better than the other approaches or solutions?
讀的部分效能大致上與他人無異, 但寫入的部分比其他方法快很多, 在損毀後修復資料
的速度也比傳統方法快

Q6. How does it perform?
1.使用同樣的機器Sun-4/260不同的作業系統Sprite和SunOS, 一個是LFS, 一個是FFS
但LFS不使用clean(最佳狀態),做創造, 讀, 刪除10000個1kB的檔案
2.做順序性寫入/讀取, 隨機寫入/讀出, 順序性讀出隨機寫入的檔案

Q7. Why is this work important?
cpu與硬碟效能不對稱發展, 造成效能瓶頸, lfs能加快寫入速度, 不像傳統ffs不能利用
較快的cpu帶來好處, lfs在較快的cpu上有更快的寫入表現.

Q8. Can any improvement be done?
由於設計時只專注於改進寫入, 造成讀取可能的問題被忽略, 會有嚴重的fragmentation
, 再來是硬碟剩餘的可用空間多寡會影響效能; clean mechanism頻繁亦會造成負擔