Summary of paper on RAID survey
- Summary of paper on RAID survey
Abstract
- disk arrays
- to use parallelism b/w disks to improve I/O performance
- intro to disk technologies
- need for disk arrays
- performance
- reliability
- techniques used in disk arrays
- striping
- improves performance
- redundancy
- improves reliability
- striping
- description of disk array architectures (RAID 0-6)
- refining RAID levels to improve performance and designing algos to maintain consistency
Introduction
- RAID
- Redundant Array of Inexpensive(or Independent) Disks
- driving force?
- performance and density of semiconductors is improving
- faster microprocessors, larger primary memory
- need larger, and high performant secondary storage
- Amdahl’s law
- large improvements in microprocessors, still less improvements overall unless secondary storage improves
- disc access times improve < 10%/year, mup performance > 50%/year
- disc transfer rates ~ 20%/year
- performance gap is widening
- large improvements in microprocessors, still less improvements overall unless secondary storage improves
- data intensive applns bulding up daily
- image processing
- video, hypertext, multimedia
- high performance n/w based storage systems are required
Disk Arrays
- stripe data across multiple disks
- access in parallel
- high data transfer rate on large data access
- high I/O rate on small data access
- uniform load balancing
- Problem?
- disk failures
- n disks => n times more likely to fail
- Mean Time to Failure(MTTF) of x hrs for a disk means MTTF of x/n for a n disk array
- Solution?
- redundancy
- error correcting codes
- Problems?
- write operation => write to all redundant parts too
- so, write performance decreases
- maintaining consistency in concurrent operations and system crashes
- Problems?
- Various striping and redundancy techniques
- each have some trade-offs among reliability, performance and cost
Background
Disk Terminology
- platter
- coated with magnetic medium
- rotates at constant angular velocity
- disk arms with magnetic r/w heads that move radially across platter’s surfaces by an actuator
- once positioned, data is r/w in small arcs called sectors on platter and platter rotate relative to heads
- although all heads move collectively, only one can read at a time
- a complete circular swath is called track
- vertical collection of tracks at a radial position is called cylinder
- Disk service time consists of
- seek time
- to move head to correct radial position, ranges from 1 to 30ms
- rotational latency
- to bring desired sector under head, ranges from 8 to 28ms
- data transfer time
- depends on rate at which data can be transferred from/to platter’s surface
- depends upon rate of rotation, density of magnetic media, radial distance of head from center
- zone bit recording can be used to store more data on outside tracks
- ranges from 1 to 5 mb/sec
- seek time + rotational latency = head-positioning time
- seek time
- Total time depends upon the layout of data as well
- if it is laid sequentially within a single cylinder, vs
- if it is laid randomly in say 8kb blocks
- so, I/O intensive applications are categorized as
- high data rate - minimal head posiitoning via large, seq accesses
- eg scientific programs with large arrays
- high I/O rate - lots of head positioning via small, more random accesses
- eg transaction processing programs
- high data rate - minimal head posiitoning via large, seq accesses
Data paths
- information stored on platter in form of magnetic fields
- it is sensed and digitized into pulses
- pulses are decoded to separate data bits from timing-related flux
- bits aligned into bytes, error-correcting codes applied and extracted data delivered to host as data blocks over peripheral bus interface such as SCSI (Small Computer Standard Interface)
- these steps are performed by on-disk controllers
- the peripheral bus interfaces also include mapping from logical block number to actual cylinder,sector,track
- this allows to avoid bad areas of disk by remapping bad logical blocks
- topologies and devices on path b/w disk and host vary depending on size and type of I/O system
- channel path consists of
- channel
- storage director
- head of string(collection of disks that share same pathway)
- channel processor is aka I/O controller or Host Bus Adapter (HBA)
- from HBA, data is transferred via direct memory access over a system bus to host OS buffers
- channel path consists of
Technology trend
- recording density is increasing
- allows capacity to stay constant or increase when size is decreasing
- with increase in rotational speed, transfer rate also increases
Disk Array Basics
Striping and Redundancy
- Striping
- distributes data over multiple disks to make them appear as single fast, large disk
- multiple independent requests could be serviced in parallel by separate disks => queuing time decreases
- single multi-block requests can be serviced by multiple requests in coordination => effective transfer rate increases
- More the # of disks, better the performance, but larger # of disks decrease reliability
- distributes data over multiple disks to make them appear as single fast, large disk
- Redundant Disk array organizations can be distinguished on basis of
- granularity of data interleaving
- fine grained
- interleave data in small units
- all I/O reqs access all disks in disk array
- high data transfer rates but
- only 1 logical I/O request can be serviced at a time
- all disks must waste time positioning for evry request
- coarse grained
- interleave in large units
- small request use less disks so multiple can be served
- large request use all disks so high transfer rate
- fine grained
- how redundant information is computed and distributed
- granularity of data interleaving
- How to compute redundant information
- parity
- hamming code
- reed-solomon code
- how to distribute redundant information
- 2 patterns
- concentrate redundant information on small # of disks
- distribute the information across all disks
- desirable
- load-balancing prob avoided
- 2 patterns
Basic RAID Organizations

Nonredundant - RAID Level 0
- lowest cost
- no redundancy
- best write performance
- read performance is not better
- redundant disks could do better by scheduling reqs on disks with shortest expected seek and rotational delays
- single disk failure => data loss
- used in supercomputing environments where performance and capacity is preferred over reliability
Mirrored - Level 1
- mirroring or shadowing
- use 2x #disks
- write is done on both set
- read can be done from one with shorter queing,seek,and rotatinal delays
- usage in database applications
Memory-Style ECC - Level 2
- much less cost than mirroring
- uses Hamming codes
- number of redundant disks is proportional to log of total nuber of disks
- when a disk fails, several of parity components will have inconsistent values and failed is the one held in common by each incorrect subset
- lost information is recovered by reading other components in subset including parity component
Bit interleaved parity - level 3
- single parity disk to recover lost information
- data is conceptually interleaved bit-wise over data disks
- and there is single parity disk
- a read request accesses all data disks
- write request accesses all data + parity disks
- so only 1 request can be serviced at a time
- parity disk contain no data so cannot participate in reads
- user for high bandwidth but nor I/O rates
Block interleaved parity - Level 4
- data is interleaved in blocks of arbitrary size
- size is called striping unit
- read request < striping unit access only 1 data disk
- write request must update parity and data blocks
- if write uses all disk then parity is just xor of all
- a small write(on ledss than all disks) requires 4 disk I/O
- write new data
- read old data
- read old parity
- write new parity
- parity disk can be a bottleneck since it is used in every write request
Block interleaved distributed parity - level 5
- distributes parity uniformly over all disks
- all disks participate in read
- have best small read, large read, large write performances
- small write reqs are inefficient due to 4 I/O reqd
- how u distribute parity can affect performance
- best was left symmetric case
- universally preferred over level 4
P+Q redundancy - level 6
- uses reed solomon codes to protect 2 disk failures
- uses 2 redundant disks
- similar to block interleaved structurally
- small writes require 6 disk I/Os
Performance and Cost comparisons
- metrics to evaluate disk arrays
- reliability
- performance
- #io per second
- cost
- generally aggregate throughput is more important than response time on individual requests
- normalize metrics i.e. intead of #io/second, take #io/second/dollar
- compare systems with equivalent file capacities (amount of information that can be stored apart from redundant information storage)

Reliability
- if only independent disk failures are considered, MTTF for RAID level 5 is
- MTTF and MTTR are mean time to failure and repair for one disk
- N is number of disks
- and G is parity group size
- 100 disks, each with MTTF = 200k hrs, and MTTR = 1hr, with G = 16 => MTTF = 3000yrs
- For disk array with 2 redundant disk per parity group such as P+Q have MTTF as
- using same values => 38 mn yrs
- factors like system crashes, uncorrectable bit errors, correlated disk failures can affect reliability a lot
System Crashes and Parity Inconsistency
- System crash = power failure or s/w crash or h/w breakdown etc
- can lead to inconsistent states
- parity updated, data not and vice versa
- use redundant h/w ? still 100% reliability can’t be achieved
- bit interleaved and block interleaved both have this prob but latter is of concern
- because it can change parity of data that wasn’t being written
- so system crash is equivalent to disk failure
- also, system crashes occur more frequently and in P+Q type => 2 disk failures?
- solution?
- logging of data to non-volatile storage before writing
Uncorrectable Bit Errors
- fail to read/write small bits of data
- generated because data is incorrectly written or gradually damaged as magnetic media ages
- when constructing a failed disk by reading data from unfailed disks, assuming a bit error rate of 1 in 10^14 => success prob is 99.2% => 0.8% times you could not recover it due to uncorrectable bit errors
Correlated Disk Failures
- environmental and manufacturing factors can cause correlated disk failures frequently
-
earthquake, power surges, power failures etc
- three modes for failure for raid 5
- double disk failure
- system crash followed by a disk failure
- disk failure followed by an uncorrectable bit error during reconstruction.
Failure Characteristics for Level 5 and P+Q RAID types


Implementation Considerations
- Disk array state information contains of data and parity information, and apart from that
- it also contains information like which disks are failed, how much of a failed disk has been reconstructed, which sectors are currently being updated (because these information are reqd when system crashes)
- this information is known as metastate information
Avoiding Stale data
- need to maintain information about validity of each sector of data and parity in disk array
- when a disk fails, marks its logical sectors invalid (prevents corrupted data read)
- when invalid sector is reconstructed, mark it valid (ensures new writes update the information on this sector)
- maintain vailid/invalid information as bit vector
- updating valid/invalid metastate information to stable storage does not present significant performance overhead
Regenerating Parity after System Crash
- information about consistent/inconsistent parity sectors is required to be logged to a stable storage to avoid regeneration of all parity sectors in case of a failure
- before performing write, mark parity sectors as inconsistent
- and after bringning a system up from system crash, regenerate all inconsistent marked parity sectors
- and periodically these sectors should be marked as consistent
- performance prob in S/W RAID systems since they do not have access to fast non volatile storage
- methods to implement
- mark parity sector inconsistent b4 every write
- keep a MRU pool of inconsistent parity sectors in memory to avoid remarking of already marked inconsistent sectors
- this works if there is a large locality of reference
- in transaction processes, large # of random write reqs, use group commit mechanism, a large # od parity sectors can be marked inconsistent with single access to non-volatile storage
- improves throughput but results in higher latencies for some writes
Operating with failed disks
- system crash in block-interleaved is similar to disk failure (it can result in loss of parity info)
- some logging is reqd on every write to prevent loss of information in case of system crash
- 2 ways to do so
- demand reconstruction
- requires stand-by spare disks
- access to parity stripe with invalid sector trigger reconstruction of appropriate data immediately to a spare disk
- write ops never deal with invalid sectors created by disk failures
- parity sparing
- no stand-by spares reqd
- request additional metastate information
- b4 servicing a write request to invalid sector parity stripe, sector is reconstructed and relocated to overwrite its corresponding parity sector
- and sector is marked as relocated
- when failed sisk is replaced
- relocated sector is copied to spare disk
- parity is regenerated
- sector is no longer marked as relocated
- demand reconstruction
Advanced Topics
Improving Small Write Performance on RAID level 5
- small writes require 4 disk I/Os
- increases response time by factor of 2, and decreases throughput by factor of 4
- and, mirrored disk arrays just use 2 disk I/Os and hence throughput decreases by a factor of 2 only
- so applns like transaction processing can’t work with RAID lvl 5 as it is
- 3 techniques to improve small write performance
- buffering and caching
- floating parity
- parity logging
Buffering and caching
- write buffering/async writes
- acknowledge user’s writes b4 actually writing
- store in buffer
- response time improves
- throughput can be improved as
- overwrites to same data can be done w/o actually going to disk
- disk scheduler gets a larger view of I/O requests and hence can schedule effectively
- under high load, write buffer fills earlier than it empties and response time will still be 4 times worse than RAID 0
- can also gp sequential writes together => can turn small writes to larger writes
- Read Caching
- improve resp time and throughput while reading data
- assumed temporal locality
- if old data is cached, write need only 3 disk I/Os now
- if recently written parity is also cached, only 2 disk I/Os now
- parity computed over logically consecutive sectors so caching it exploits both temploral and spatial locality
- improve resp time and throughput while reading data
Parity Logging
- delay read of old parity and write of new parity
- instead of updating parity immediately, write diff b/w old and new parity known as update image to a log
- store parity update image temporarily in nonvolatile memory
- when this memory fills up, write update image yo log region on disk
- when log fills up, read parity update image in memory and add to old parity and write result to disk
- more data is transferred to and from disk but larger accesses are efficient than smaller ones
- small write overhead reduces from 4 to little more than 2 disk I/Os
Declustered Parity
- disk failures can cause performance degradations in standard RAID level 5 disk arrays
- a workload with small reads will double effective load at nonfailed sisks due to extra disk accesses needed to reconstruct data for reads to failed disk
- though striping data across several RAIDs reduces average increase in load than RAIDs with one large parity gp, but RAID with failed disk still experiences a 100% increase in load in worst case
- Prob is although inter-RAID striping distributes load uniformly when no disk is failed. it nonuniformly distrivbutes the increased load that results from a failed disl
- small set of disks in same parity gp as failed sisk bear entire weight of increased load
- declustered-parity RAID solves this prob by distributing load uniformly over all disks

- Disadvantages?
- less reliable
- any 2 disks failures will result in data loss since each pair of disks has a parity gp in common
- complex parity gps could disrupt sequential placement of data across the disks
- less reliable
Exploiting on-line spare disks
- on-line spare disks allow reconstruction of failed disks to start immediately reducing the window of vulnerability during which an additional disk failure would result in data loss
- but they are idle most of time and do not contribute to normal operations
- 2 techniques can be used to exploit them to enhance performance during normal ops
- distributed sparing

- distribute capacity of spare disk across all disks in array
- instead of N data disks and 1 spare disks, u have N+1 data disks each having 1/(N+1)th spare capacity
- when a disk fails its blocks are constructed on correspondig spare blocks
- disadvantages?
- reconstructed data must eventually be copied onto a permanent replacement of failed disk, but copying can be done leisurely and hence not much affect on performance
- reconstructed data is distributed across many disks whereas it was with single disk formerly, so original darta placement is disturbed
- parity sparing

- it uses spare capacity to store parity information
- second parity can be used to partition disk array logically into two separate disk arrays
- and it can be used to implement P+Q redundancy
- distributed sparing
Data Striping in Disk Arrays
- deciding striping size requires to overcome 2 conflicting goals
- maximizing amount of data a disk transfers with each logical I/O
- positioning time is wasted work so maximizing amount of useful work done b/w two positionings is beneficial
- utilizing all disks in an I/O
- Idle times are also wasted time
- ensuring this means data is spread widely among more disks and hence each disk transfers less data in one logical I/O
- finding balance between two is tradeoff in deciding data distribution and is workload dependent
- maximizing amount of data a disk transfers with each logical I/O
- If concurrency of workload is known, formula for striping unit providing 95% of maximum throughput possible for any particular request distribution

- avg positioning time = avg seek time + avg rotational delay
- small concurrency give smaller striping unit allowing all acesses to utilize all disks
- large concurrency gives large striping unit and more different accesses can be serviced in parallel
- if nothing is known about concurrency
