Semester6

The repo stores things I did in my 6th semester at BITS Pilani

Summary of paper on RAID survey

Paper link

Summary of paper on RAID survey

Abstract

disk arrays
- to use parallelism b/w disks to improve I/O performance
intro to disk technologies
need for disk arrays
- performance
- reliability
techniques used in disk arrays
- striping
  - improves performance
- redundancy
  - improves reliability
description of disk array architectures (RAID 0-6)
refining RAID levels to improve performance and designing algos to maintain consistency

Introduction

RAID
- Redundant Array of Inexpensive(or Independent) Disks
driving force?
- performance and density of semiconductors is improving
- faster microprocessors, larger primary memory
- need larger, and high performant secondary storage
Amdahl’s law
- large improvements in microprocessors, still less improvements overall unless secondary storage improves
  - disc access times improve < 10%/year, mup performance > 50%/year
  - disc transfer rates ~ 20%/year
- performance gap is widening
data intensive applns bulding up daily
- image processing
- video, hypertext, multimedia
- high performance n/w based storage systems are required

Disk Arrays

stripe data across multiple disks
- access in parallel
- high data transfer rate on large data access
- high I/O rate on small data access
- uniform load balancing
Problem?
- disk failures
- n disks => n times more likely to fail
- Mean Time to Failure(MTTF) of x hrs for a disk means MTTF of x/n for a n disk array
Solution?
- redundancy
- error correcting codes
  - Problems?
    - write operation => write to all redundant parts too
    - so, write performance decreases
    - maintaining consistency in concurrent operations and system crashes
Various striping and redundancy techniques
- each have some trade-offs among reliability, performance and cost

Background

Disk Terminology

platter
- coated with magnetic medium
- rotates at constant angular velocity
disk arms with magnetic r/w heads that move radially across platter’s surfaces by an actuator
once positioned, data is r/w in small arcs called sectors on platter and platter rotate relative to heads
although all heads move collectively, only one can read at a time
a complete circular swath is called track
vertical collection of tracks at a radial position is called cylinder
Disk service time consists of
- seek time
  - to move head to correct radial position, ranges from 1 to 30ms
- rotational latency
  - to bring desired sector under head, ranges from 8 to 28ms
- data transfer time
  - depends on rate at which data can be transferred from/to platter’s surface
  - depends upon rate of rotation, density of magnetic media, radial distance of head from center
  - zone bit recording can be used to store more data on outside tracks
  - ranges from 1 to 5 mb/sec
- seek time + rotational latency = head-positioning time
Total time depends upon the layout of data as well
- if it is laid sequentially within a single cylinder, vs
- if it is laid randomly in say 8kb blocks
so, I/O intensive applications are categorized as
- high data rate - minimal head posiitoning via large, seq accesses
  - eg scientific programs with large arrays
- high I/O rate - lots of head positioning via small, more random accesses
  - eg transaction processing programs

Data paths

information stored on platter in form of magnetic fields
it is sensed and digitized into pulses
pulses are decoded to separate data bits from timing-related flux
bits aligned into bytes, error-correcting codes applied and extracted data delivered to host as data blocks over peripheral bus interface such as SCSI (Small Computer Standard Interface)
these steps are performed by on-disk controllers
the peripheral bus interfaces also include mapping from logical block number to actual cylinder,sector,track
- this allows to avoid bad areas of disk by remapping bad logical blocks
topologies and devices on path b/w disk and host vary depending on size and type of I/O system
- channel path consists of
  - channel
  - storage director
  - head of string(collection of disks that share same pathway)
- channel processor is aka I/O controller or Host Bus Adapter (HBA)
- from HBA, data is transferred via direct memory access over a system bus to host OS buffers

Technology trend

recording density is increasing
- allows capacity to stay constant or increase when size is decreasing
- with increase in rotational speed, transfer rate also increases

Disk Array Basics

Striping and Redundancy

Striping
- distributes data over multiple disks to make them appear as single fast, large disk
  - multiple independent requests could be serviced in parallel by separate disks => queuing time decreases
  - single multi-block requests can be serviced by multiple requests in coordination => effective transfer rate increases
- More the # of disks, better the performance, but larger # of disks decrease reliability
Redundant Disk array organizations can be distinguished on basis of
- granularity of data interleaving
  - fine grained
    - interleave data in small units
    - all I/O reqs access all disks in disk array
    - high data transfer rates but
      - only 1 logical I/O request can be serviced at a time
      - all disks must waste time positioning for evry request
  - coarse grained
    - interleave in large units
    - small request use less disks so multiple can be served
    - large request use all disks so high transfer rate
- how redundant information is computed and distributed
How to compute redundant information
- parity
- hamming code
- reed-solomon code
how to distribute redundant information
- 2 patterns
  - concentrate redundant information on small # of disks
  - distribute the information across all disks
    - desirable
    - load-balancing prob avoided

Basic RAID Organizations

raid levels

Nonredundant - RAID Level 0

lowest cost
no redundancy
best write performance
read performance is not better
- redundant disks could do better by scheduling reqs on disks with shortest expected seek and rotational delays
single disk failure => data loss
used in supercomputing environments where performance and capacity is preferred over reliability

Mirrored - Level 1

mirroring or shadowing
- use 2x #disks
write is done on both set
read can be done from one with shorter queing,seek,and rotatinal delays
usage in database applications

Memory-Style ECC - Level 2

much less cost than mirroring
uses Hamming codes
number of redundant disks is proportional to log of total nuber of disks
when a disk fails, several of parity components will have inconsistent values and failed is the one held in common by each incorrect subset
lost information is recovered by reading other components in subset including parity component

Bit interleaved parity - level 3

single parity disk to recover lost information
data is conceptually interleaved bit-wise over data disks
and there is single parity disk
a read request accesses all data disks
write request accesses all data + parity disks
so only 1 request can be serviced at a time
parity disk contain no data so cannot participate in reads
user for high bandwidth but nor I/O rates

Block interleaved parity - Level 4

data is interleaved in blocks of arbitrary size
size is called striping unit
read request < striping unit access only 1 data disk
write request must update parity and data blocks
if write uses all disk then parity is just xor of all
a small write(on ledss than all disks) requires 4 disk I/O
- write new data
- read old data
- read old parity
- write new parity
parity disk can be a bottleneck since it is used in every write request

Block interleaved distributed parity - level 5

distributes parity uniformly over all disks
all disks participate in read
have best small read, large read, large write performances
small write reqs are inefficient due to 4 I/O reqd
how u distribute parity can affect performance
- best was left symmetric case
universally preferred over level 4

P+Q redundancy - level 6

uses reed solomon codes to protect 2 disk failures
uses 2 redundant disks
similar to block interleaved structurally
small writes require 6 disk I/Os

Performance and Cost comparisons

metrics to evaluate disk arrays
- reliability
- performance
  - #io per second
- cost
generally aggregate throughput is more important than response time on individual requests
normalize metrics i.e. intead of #io/second, take #io/second/dollar
compare systems with equivalent file capacities (amount of information that can be stored apart from redundant information storage)

throughput

Reliability

if only independent disk failures are considered, MTTF for RAID level 5 is
- MTTF and MTTR are mean time to failure and repair for one disk
- N is number of disks
- and G is parity group size
- 100 disks, each with MTTF = 200k hrs, and MTTR = 1hr, with G = 16 => MTTF = 3000yrs
For disk array with 2 redundant disk per parity group such as P+Q have MTTF as
- using same values => 38 mn yrs
factors like system crashes, uncorrectable bit errors, correlated disk failures can affect reliability a lot

System Crashes and Parity Inconsistency

System crash = power failure or s/w crash or h/w breakdown etc
can lead to inconsistent states
- parity updated, data not and vice versa
use redundant h/w ? still 100% reliability can’t be achieved
bit interleaved and block interleaved both have this prob but latter is of concern
- because it can change parity of data that wasn’t being written
- so system crash is equivalent to disk failure
also, system crashes occur more frequently and in P+Q type => 2 disk failures?
solution?
- logging of data to non-volatile storage before writing

Uncorrectable Bit Errors

fail to read/write small bits of data
generated because data is incorrectly written or gradually damaged as magnetic media ages
when constructing a failed disk by reading data from unfailed disks, assuming a bit error rate of 1 in 10^14 => success prob is 99.2% => 0.8% times you could not recover it due to uncorrectable bit errors

Correlated Disk Failures

environmental and manufacturing factors can cause correlated disk failures frequently
earthquake, power surges, power failures etc
three modes for failure for raid 5
- double disk failure
- system crash followed by a disk failure
- disk failure followed by an uncorrectable bit error during reconstruction.

Failure Characteristics for Level 5 and P+Q RAID types

Implementation Considerations

Disk array state information contains of data and parity information, and apart from that
it also contains information like which disks are failed, how much of a failed disk has been reconstructed, which sectors are currently being updated (because these information are reqd when system crashes)
- this information is known as metastate information

Avoiding Stale data

need to maintain information about validity of each sector of data and parity in disk array
when a disk fails, marks its logical sectors invalid (prevents corrupted data read)
when invalid sector is reconstructed, mark it valid (ensures new writes update the information on this sector)
maintain vailid/invalid information as bit vector
updating valid/invalid metastate information to stable storage does not present significant performance overhead

Regenerating Parity after System Crash

information about consistent/inconsistent parity sectors is required to be logged to a stable storage to avoid regeneration of all parity sectors in case of a failure
before performing write, mark parity sectors as inconsistent
and after bringning a system up from system crash, regenerate all inconsistent marked parity sectors
- and periodically these sectors should be marked as consistent
performance prob in S/W RAID systems since they do not have access to fast non volatile storage
methods to implement
- mark parity sector inconsistent b4 every write
- keep a MRU pool of inconsistent parity sectors in memory to avoid remarking of already marked inconsistent sectors
  - this works if there is a large locality of reference
- in transaction processes, large # of random write reqs, use group commit mechanism, a large # od parity sectors can be marked inconsistent with single access to non-volatile storage
  - improves throughput but results in higher latencies for some writes

Operating with failed disks

system crash in block-interleaved is similar to disk failure (it can result in loss of parity info)
some logging is reqd on every write to prevent loss of information in case of system crash
2 ways to do so
- demand reconstruction
  - requires stand-by spare disks
  - access to parity stripe with invalid sector trigger reconstruction of appropriate data immediately to a spare disk
  - write ops never deal with invalid sectors created by disk failures
- parity sparing
  - no stand-by spares reqd
  - request additional metastate information
  - b4 servicing a write request to invalid sector parity stripe, sector is reconstructed and relocated to overwrite its corresponding parity sector
  - and sector is marked as relocated
  - when failed sisk is replaced
    - relocated sector is copied to spare disk
    - parity is regenerated
    - sector is no longer marked as relocated

Advanced Topics

Improving Small Write Performance on RAID level 5

small writes require 4 disk I/Os
increases response time by factor of 2, and decreases throughput by factor of 4
and, mirrored disk arrays just use 2 disk I/Os and hence throughput decreases by a factor of 2 only
so applns like transaction processing can’t work with RAID lvl 5 as it is
3 techniques to improve small write performance
- buffering and caching
- floating parity
- parity logging

Buffering and caching

write buffering/async writes
- acknowledge user’s writes b4 actually writing
- store in buffer
- response time improves
- throughput can be improved as
  - overwrites to same data can be done w/o actually going to disk
  - disk scheduler gets a larger view of I/O requests and hence can schedule effectively
- under high load, write buffer fills earlier than it empties and response time will still be 4 times worse than RAID 0
- can also gp sequential writes together => can turn small writes to larger writes
Read Caching
- improve resp time and throughput while reading data
  - assumed temporal locality
- if old data is cached, write need only 3 disk I/Os now
- if recently written parity is also cached, only 2 disk I/Os now
  - parity computed over logically consecutive sectors so caching it exploits both temploral and spatial locality

Parity Logging

delay read of old parity and write of new parity
instead of updating parity immediately, write diff b/w old and new parity known as update image to a log
store parity update image temporarily in nonvolatile memory
- when this memory fills up, write update image yo log region on disk
when log fills up, read parity update image in memory and add to old parity and write result to disk
more data is transferred to and from disk but larger accesses are efficient than smaller ones
small write overhead reduces from 4 to little more than 2 disk I/Os

Declustered Parity

disk failures can cause performance degradations in standard RAID level 5 disk arrays
a workload with small reads will double effective load at nonfailed sisks due to extra disk accesses needed to reconstruct data for reads to failed disk
though striping data across several RAIDs reduces average increase in load than RAIDs with one large parity gp, but RAID with failed disk still experiences a 100% increase in load in worst case
Prob is although inter-RAID striping distributes load uniformly when no disk is failed. it nonuniformly distrivbutes the increased load that results from a failed disl
- small set of disks in same parity gp as failed sisk bear entire weight of increased load
declustered-parity RAID solves this prob by distributing load uniformly over all disks

declustered

Disadvantages?
- less reliable
  - any 2 disks failures will result in data loss since each pair of disks has a parity gp in common
- complex parity gps could disrupt sequential placement of data across the disks

Exploiting on-line spare disks

on-line spare disks allow reconstruction of failed disks to start immediately reducing the window of vulnerability during which an additional disk failure would result in data loss
but they are idle most of time and do not contribute to normal operations
2 techniques can be used to exploit them to enhance performance during normal ops
- distributed sparing
  - distribute capacity of spare disk across all disks in array
  - instead of N data disks and 1 spare disks, u have N+1 data disks each having 1/(N+1)th spare capacity
  - when a disk fails its blocks are constructed on correspondig spare blocks
  - disadvantages?
    - reconstructed data must eventually be copied onto a permanent replacement of failed disk, but copying can be done leisurely and hence not much affect on performance
    - reconstructed data is distributed across many disks whereas it was with single disk formerly, so original darta placement is disturbed
- parity sparing
  - it uses spare capacity to store parity information
  - second parity can be used to partition disk array logically into two separate disk arrays
  - and it can be used to implement P+Q redundancy

Data Striping in Disk Arrays

deciding striping size requires to overcome 2 conflicting goals
- maximizing amount of data a disk transfers with each logical I/O
  - positioning time is wasted work so maximizing amount of useful work done b/w two positionings is beneficial
- utilizing all disks in an I/O
  - Idle times are also wasted time
  - ensuring this means data is spread widely among more disks and hence each disk transfers less data in one logical I/O
- finding balance between two is tradeoff in deciding data distribution and is workload dependent
If concurrency of workload is known, formula for striping unit providing 95% of maximum throughput possible for any particular request distribution
avg positioning time = avg seek time + avg rotational delay
small concurrency give smaller striping unit allowing all acesses to utilize all disks
large concurrency gives large striping unit and more different accesses can be serviced in parallel
if nothing is known about concurrency