CSE 7343/5343, Spring 2003
Topic 22: Distributed Systems/OS and Distributed Coordination
Prof. Jeff Tian, CSE/SEAS/SMU, Dallas, TX 75275
tian@engr.smu.edu; www.engr.smu.edu/~tian/class/7343.03s
- Dates: 4/10-4/17
- Reading: Ch15, Ch17.
Distributed Systems
- single CPU we dealt with so far
- distributed systems:
- multiple processors
- loosely-coupled
no shared memory or clock
- connected via computer network
- communicate through message passing
- example: Fig 15.1 p.540
- comparison to other systems
- single CPU vs. multi-processors
- distributed system vs. tightly coupled systems
- local vs. remote resources
- classical classification:
SISD, SIMD, MISD, MIMD
based on instruction (I) and data (D) streams
single (S) or multiple (M) streams
- related: parallel computing
computation vs resource based classification
- advantages of distributed systems
- resource sharing
- computation speed up
- reliability
- communication
- advantage: compare to what?
- single CPU systems
- tightly coupled systems
- isolated multi-processor systems
- much OS and network support
Distributed OS
- OS for DS:
- need to support local as well as system wide operations
- local OS: similar to what we discussed so far
- new: concurrency control, event ordering, coordination, etc.
- our focus: coordination problems
- distributed OS (DOS) types
- two types, depend on services provided
- network OS's:
individual OS's with communication and other (limited) OS support
- (fully) distributed OS's:
more OS support
- network OS services:
- remote login (telnet)
- remote file transfer (ftp)
- )realize (some) distributed system advantages
- top network command: mail, telnet, ftp
- distributed OS services:
- data/computation/process migration
- (fully) realize distributed system advantages
Network and communication protocols
- closely related to DS DOS: computer networks
- topology
- distance
- logical connection
- network (DS) topology
- fully connected vs. partially connected
- specific types of partially connected
star
tree
ring
- illustrations: Fig 15.2 p.547
- advantages and disadvantages
- connections/cost
- reliability etc.
how much of the DS advantages can be realized
- the effect of physical distance
- network types
- LAN (Fig 15.3, p.550)
- WAN (Fig 15.4, p.551)
- logical and standards: layered structure
- ISO/OSI 7-layered network
- lower 3: physical, data link, network
- upper 4: transport, session, presentation, application
- distributed communication over networks
- many issues to address:
- naming
- routing
- connection
- contention
- possible failures
- fault tolerance and other design issues
- much more material => additional courses
- similar issue for other distributed services
Distributed coordination
- focus: mutual exclusion and deadlocks
- need: event ordering and election algorithm
- mutual exclusion problem in DS
- same problem, but harder to solve
- why harder? (previous solutions)
- software solution: rely on shared resources
- hardware solution: single point of control
- distributed semaphores/monitors hard implement
- solution approaches
- centralized (similar to local OS)
- token-based
- distributed algorithms
- need event ordering or election algorithm
- needed for solving other coordination problems too
Event ordering
- event ordering difficult, because
- no single point of reference
(no shared memory or clock)
- true parallelism and concurrency
- but we need some order to coordinate processes/etc.
- event ordering important, because
- to coordinate processes/etc.
- mutual exclusion
- application in deadlocks
- other distributed OS services
- possibility for event ordering
- local event in order
- messages
- basis for event ordering
- event ordering and happened-before relation
- possibilities above and logical extensions
- local: use execution order
- message: send before receive
- transitivity: A -> B and B -> C, then A -> C
- happened-before provides a partial order
- hold: ordered
- not hold: concurrent
- examples (Fig 17.1, p.597)
- (global) time stamp: logical clock based on
- local physical clock, and
- happened-before relation
- implementation:
- local, advance with physical clock
- communication: timestamped message
may advance with received message
take the max, then increment
- result: total order
- application in many distributed coordination problems
Mutual exclusion in DS: Concept and algorithms
- same problem, but harder to solve
- solution approaches
- centralized (similar to local OS)
- token-based
- distributed algorithms
- ME in DS: distributed algorithms
- timestamp (TS) based solution:
- in CS, defer reply
- not trying to enter CS, replay immediately
- competing to enter CS, compare TS values, and enter in FIFO order
- algorithm on p.599
- key: timestamp!
- how to maintain?
- satisfies solution requirements
M.E., progress, bounded waiting
- messages needed 2N
- other improvement: message reduction
ME in DS: Other solutions/algorithms
- ME in DS: token-based solution
- (logical) token ring
- with token, enter C.S., otherwise wait
- consequences of the scheme:
- prolonged waiting
- token problem
- network problem
- ME in DS: centralized solution
- one processor acts for the whole system
- use existing solutions
- problem: single point of failure
- alleviate: election algorithm
- solve: distributed or other approaches
- give up DS advantages?
- election algorithms
- many situations where it is useful
- normal: algorithm for coordination problems
- failure: backup and fault-tolerance issues
- bully and right algorithms in book: based pre-determined priority
- other algorithms possible
Deadlocks in DS
- similar to previous deadlock handling
- distributed modifications/extensions
- local deadlock => deadlock in the system
- no local deadlock, global deadlock still possible
- local RAG and global RAG
- global information available:
=> may use existing algorithms
- distributed extensions
- generic techniques:
(still) prevention, avoidance, detection&recovery, do nothing
- do nothing
- mostly networked OS
(local OS + communication)
limited coordination
- some fully distributed OS as well
- when deadlock probability low
- associated cost
- cost-benefit analysis
- deadlock prevention
- circular wait: modifications
- other three conditions:
similar as before
no extra information needed
- global resource ordering to break circular wait
- dedicated site to maintain RAG
(centralized solution)
- drawback to above scheme
- detection and recover
- modified detection algorithms: later
- rollback/recovery scheme
- presented in prevention
- priority scheme to determine rollback
- possibility of starvation
- timestamp-based priority schemes:
- termination (wait-die) or
- rollback (wounded-wait)
- again: based on event ordering
- deadlock avoidance: not used as often
- why?
- difficulty with resource/allocation tracking
- difficulty with safety algorithms
related: frequent running of safety algorithms
- see detection algorithm later
Deadlock modeling/detection in DS
- deadlock modeling:
- local and global RAG and wait-for graphs
- deadlock detection:
- centralized solution:
- fully distributed deadlock detection
(with or) without exact information
- centralized solution:
- maintain global information at a site
- responsible for deadlock detection
- message delay effect on model
possible false alarms
actions resulting in resource release
rollback or termination due to other reasons
- give up DS advantages?
- centralized detection algorithm
- controller initiate
- other sites sends local RAG to controller
- controller construct global RAG:
local RAG +
inter-processor request with TS
- possible use of election algorithms
- fully distributed deadlock detection
- each site: partial RAG
- deadlock: some site detect it
- local processes: individually
- external processes: Pex
- possible deadlocks, but not always (more false alarms)
- much harder problem in distributed environment
- harder problem, but
- not as frequently occurring
-- not as much reliance on remote resources
Prepared by Jeff Tian
(tian@engr.smu.edu).
Last update April 24, 2003.