Sweep Away the Garbage

for scalable, fault-tolerant shared VM storage

Adam Litke - alitke@redhat.com
FOSDEM 2016 - 30 January 2016

The next 40 minutes

  • oVirt shared storage architecture
  • Preventing data corruption
  • Recovering from failure
  • Examples

Local vm storage

Local vm storage

Multi-host local vm storage

Shared vm storage

oVirt shared storage

oVirt storage domain

oVirt image

oVirt volume

Storage operations

Datapath operations

  • A VM or host accessing volume contents
  • These are the most common and most important
  • Lots of IO
  • Long running
  • Narrow in scope

Example: VM volume access

Example: Host volume access

Metadata operations

  • Adding / removing / rearranging storage objects
  • Changing storage domain metadata
  • Minimal IO
  • Short running
  • Can have broad scope

Example: create volume

Example: delete image

Challenge: conflicts

Preventing conflicts

  • Requirement: data integrity
  • Goal: maximize concurrency
  • Interaction between storage objects is complex
  • Orchestration required across several domains
      User actions
      Hosts
      Local threads

Same VM on multiple hosts

Conflicting metadata updates

Run VM during snapshot

Solution: Locking

Management level locking

  • Entities are locked while executing user-driven actions
    • Lock an image during creation
    • Lock a VM while taking a snapshot
    • Lock a host while it modifies storage

Shared storage locking

  • Implemented using Sanlock
  • Lockspace is on shared storage
  • Leases grant hosts exclusive access to storage resources
    • Storage domain lease: needed for metadata changes
    • Volume lease: protects volume contents

More about sanlock

  • Host IDs
    • Every host has a unique ID
    • Uniqueness is enforced by SANlock
    • IDs must be periodically renewed
    • Failure to renew will surrender all resource leases
  • Resource leases
    • Represent an arbitrary resource (storage or otherwise)
  • Misbehaving hosts will be fenced (rebooted)

Process level locking

  • Implemented with a local lock manager and RWLocks
  • Locks grant threads either shared or exclusive access
    • Storage domain lock: protects metadata
    • Image lock: protects volume chain and metadata

Challenge: interruptions

Handling interruptions

  • Some steps in a task are never completed
  • Happen naturally or due to bugs
    • Power or network outage
    • Hardware failure
    • Software failure
  • Must be carefully mitigated to keep storage coherent
  • Approaches
    • Storage task manager with rollback capability
    • Storage transactions with garbage collection

Interrupted volume creation

Interrupted volume copy

Solution: Transactional Storage

  • Storage transactions
  • Garbage collection
  • Monitoring and resolution

Storage transactions

  • Storage commands must be a single transaction
  • A transaction is opened with a marker operation
  • Subsequent steps accumulate "garbage" on storage
  • A transaction is committed by converting the start marker

Example

Garbage collection

  • Runs periodically on an arbitrary host
  • Identifies candidates by finding markers
  • Acquires necessary locks for the candidate
  • Verifies the candidate should be collected
  • Cleans garbage associated with the marker
  • Removes the marker

Identify candidate

Acquire locks

Clean

Remove marker

Identify candidate

Acquire locks

Abort

Monitoring and resolution

  • Running commands raise events or can be polled
    • Progress
    • State changes
    • Error code and context
  • Command results are not persistent
  • Success or failure is evident by examining storage

Practical examples

  • Create volume
  • Remove volume
  • Clone volume

Example: Create volume

Acquire domain lease

Acquire image lock

Create volatile image directory

Create volatile metadata file

Create lease file

Create volume data file

Commit metadata file

Commit image directory

Release image lock

Release domain lease

Example: Remove volume

Existing volume

Acquire domain lease

Acquire image lock

Make image volatile

Invoke the garbage collector

Release image lock

Release domain lease

Example: Clone volume

Existing volume

Create another volume

Acquire source image lock

Acquire target image lock

Acquire source volume lease

Acquire target volume lease

Mark target volume illegal

Copy data

Progress event

Mark target volume legal

Release target volume lease

Release source volume lease

Release target image lock

Release source image lock

Completion event

Locking order

  • Strict rules needed to prevent deadlock
  • Storage leases before local locks
  • Big containers before smaller containers
    • Storage Domain ➡ Image ➡ Volume
  • Source volume before destination volume
  • Release the newest locks first

Join us!

Questions?