UP | HOME

Paper Index

目录

1 Distributed System

1.1 Google

1.1.1 The Google File System (SOSP03)

EN, CN, PDF, Reading Notes.

GFS是分布式存储领域非常著名的一篇论文。HDFS、阿里云Pangu存储系统,都是参考论文 实现的。核心思想是:

  • 使用Paxos/RAFT一致性组实现高可用的Master,负责文件系统元数据,且全部元数据 常驻内存;
  • 使用ChunkServer提供单机引擎,只支持AppendOnly写入,IO数据不过Master;
  • 使用SDK提供类POSXI文件系统接口,封装Master/ChunkServer交互过程。

GFS有Master热点,集群QPS有数万,不适合保存海量小文件。海量小文件的需求,应该 在GFS基础之上搭建一款文件系统,而用GFS实现Bootstrap。

为支持超大规模集群,GFS可以用单集群多组Master,这称为联邦(Fedoration),HDFS 有文章描述这部分:https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/Federation.html

GFS在Google内部演进为分布式元数据以支持更大规模集群,内部产品代号Colossus,但 没有具体的论文发出来。在网上可以找到这篇链接: https://www.systutorials.com/colossus-successor-to-google-file-system-gfs/ 部分内容节抄如下:

  • "We also ended up doing what we call a "multi-cell" approach, which basically

made it possible to put multiple GFS masters on top of a pool of chunkservers."

  • "We also have something we called Name Spaces, which are just a very static

way of partitioning a namespace that people can use to hide all of this from the actual application." … "a namespace file describes"

  • "The distributed master certainly allows you to grow file counts, in line

with the number of machines you’re willing to throw at it." … "Our distribute master system that will provide for 1-MB files is essentially a whole new design. That way, we can aim for something on the order of 100 million files per master. You can also have hundreds of masters."

  • BigTable "as one of the major adaptations made along the way to help keep

GFS viable in the face of rapid and widespread change."

1.1.2 Map Reduce: Simplified Data Processing On Large Clusters

PDF.

1.1.3 Bigtable: A Distributed Storage System for Structured Data

PDF, Reading Notes. 借助Chubby实现选主、Tablet三层结构、Client缓存更新机制、LSM数据引擎,对于 存储系统的实现有很好的借鉴意义。

1.1.4 The Chubby lock service for loosely-coupled distributed systems

PDF.

1.1.5 Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

PDF.

1.1.6 Percolator: Large-scale Incremental Processing Using Distributed Transactions and Notifications

PDF.

1.1.7 Megastore: Providing Scalable, Highly Available Storage for Interactive Services

PDF.

1.1.8 Spanner: Google's Globally-Distributed Database

PDF.

1.1.9 F1: A Distributed SQL Database That Scales

PDF.

1.1.10 Goods: Organizing Google's Datasets

PDF.

1.1.11 Colossus: Next generation of GFS

论文未发表,整理各处搜集来的资料。

Google File System II: Dawn of the Multiplying Master Node
https://www.theregister.com/2009/08/12/google_file_system_part_deux/

1.2 Microsoft

1.2.1 Window Azure Storage

Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency. PDF.

Window Azure Storage论文描述了微软云存储系统,该系统提供了文件(blob), 表格(table), 队列(queue)存储产品。Blob类似阿里云OSSTable类似阿里云OTS/TableStoreQueue类似 阿里云MessageQueue

WAS提供URL访问接口:

http(s)://AccountName.<service>.core.windows.net/PartitionName/ObjectName
<service> is blob, table or queue.

WAS顶层架构如下:

ds-microsoft-was.png

  • Stream Layer: 分布式文件系统;SM(StreamManager)维护元数据,基于Paxos算法 实现高可用;EN(ExtentNode)提供AppendOnly数据服务;类似GFS/HDFS/PANGU;
  • Partition Layer: LSM-Tree结构;可扩展对象命名空间;给对象提供事务顺序和 强一致性;保存数据到Stream Layer;缓存高频数据;
  • Front-End(FE) Layer: 鉴权(auth), 之后路由Requests到PartitionServer;FE缓存 PartitionMap,直接向StreamLayer发送大对象数据,缓存高频数据;

StreamLayer提供同步的Replication, PartitionLayer提供异步的Replication。

1.2.2 Pelican: A building block for exascale cold data storage

Pelican: A building block for exascale cold data storage.

1.3 Tencent

1.3.1 PaxosStore: High-availability Storage Made Practical in WeChat

1.4 ceph

1.4.1 Ceph: A Scalable, High-Performance Distributed File System

PDF.

1.4.2 CRUSH

PDF.

1.4.3 File Systems Unfit as Distributed Storage Backends: Lessons from 10 Years of Ceph Evolution

PDF.

1.5 HDFS

1.5.1 The Hadoop Distributed File System

PDF.

1.5.2 HDFS: Balancing Portability and Performance

PDF.

1.6 Consensus Algorithms

1.6.1 Lamport The Part-Time Parliament

PDF.

1.6.2 Lamport The Byzantine General Problem

PDF.

1.6.3 Lampson How to Build a Highly Availability System using Consensus

PDF.

1.6.4 Revisiting the Paxos Algorithm

PDF.

1.6.5 Paxos made simple

PDF.

1.6.6 Cheap Paxos

PDF.

1.6.7 Fast Paxos

PDF.

1.6.8 Paxos Made Live - An Engineering Perspective

PDF.

1.6.9 Raft - In Search of an Understandable Consensus Algorithm

PDF.

1.6.10 Consensus: Bridging theory and practice

PDF.

1.6.11 ViewStamped Replications

PDF.

1.7 Transactions

1.7.1 Two Phase Commit

1.7.2 Nonblocking Commit Protocols

PDF.

1.7.3 Consensus on Transaction Commit

PDF

1.7.4 Revisiting the relationship between non-blocking atomic commitment and consensus

1.8 Distributed base

1.8.1 Dijkstra Solution of a Problem in Concurrent Programming Control

PDF.

1.8.2 Dijkstra Self-stabilizing Systems in Spite of Distributed Control

PDF.

1.8.3 Jim Gray Why Do Computers Stop and What Can Be Done About It?

PDF.

1.8.4 A New Solution of Dijkstra's Concurrent Programming Problem

PDF.

1.8.5 Lamport Time, Clocks, and the Ordering of Events in a Distributed System

PDF.

1.8.6 Distributed Snapshots - Determining Global States of a Distributed System

PDF.

1.8.7 Virtual Time and Global States of Distributed Systems

PDF.

1.8.8 Impossibility of Distributed Consensus with One Faulty Process

PDF.

Lamport是2013图灵奖得主,这里是链接

2 Unsorted

3 Reference Websites