日志:每个软件工程师都应该知道的有关实时数据的统一概念 - 经典必读, 英文版
日志解决了两个核心的问题:变更动作的顺序(排序)和数据分发(通过日志把变更序列传输到其他服务、Slave or Replica。
You can reduce the problem of making multiple machines all do the same thing to the problem of implementing a distributed consistent log to feed these processes input. The purpose of the log here is to squeeze all the non-determinism out of the input stream to ensure that each replica processing this input stays in sync.
数据集成 - 使组织的所有数据可轻松地在其所有存储和处理系统中使用
实时数据流处理 - 计算派生数据流
分布式系统设计 - 以日志为中心的设计如何简化实际系统。
The log also acts as a buffer that makes data production asynchronous from data consumption. This is important for a lot of reasons, but particularly when there are multiple subscribers that may consume at different rates. This means a subscribing system can crash or go down for maintenance and catch up when it comes back: the subscriber consumes at a pace it controls. A batch system such as Hadoop or a data warehouse may consume only hourly or daily, whereas a real-time query system may need to be up-to-the-second. Neither the originating data source nor the log has knowledge of the various data destination systems, so consumer systems can be added and removed with no change in the pipeline.
A log, like a filesystem, is easy to optimize for linear read and write patterns. The log can group small reads and writes together into larger, high-throughput operations. Kafka pursues this optimization aggressively. Batching occurs from client to server when sending data, in writes to disk, in replication between servers, in data transfer to consumers, and in acknowledging committed data.
The real driver for the processing model is the method of data collection. Data which is collected in batch is naturally processed in batch. When data is collected continuously, it is naturally processed continuously.
在逻辑层面系统可以被分为两部分:日志层和服务层;日志层顺序捕获数据状态变更;服务层用来构建这个服务需要使用的索引结构以方便应对实际需求的查询,比如一个 kv 服务需要把数据构建成 btree 索引或 sstable 索引,这样更有利于提高查询效率;