Flume:分布式日志采集处理系统
Flume是Cloudera提供的一个高可用的,高可靠的,分布式的海量日志采集、聚合和传输的系统,Flume支持在日志系统中定制各类数据发送方,用于收集数据;同时,Flume提供对数据进行简单处理,并写到各种数据接受方(可定制)的能力。
Flume 初始的发行版本目前被统称为 Flume OG(original generation),属于 cloudera。但随着 FLume 功能的扩展,Flume OG 代码工程臃肿、核心组件设计不合理、核心配置不标准等缺点暴露出来,尤其是在 Flume OG 的最后一个发行版本 0.94.0 中,日志传输不稳定的现象尤为严重,为了解决这些问题,2011 年 10 月 22 号,cloudera 完成了 Flume-728,对 Flume 进行了里程碑式的改动:重构核心组件、核心配置以及代码架构,重构后的版本统称为 Flume NG(next generation);改动的另一原因是将 Flume 纳入 apache 旗下,cloudera Flume 改名为 Apache Flume。IBM 的这篇文章:《 Flume NG:Flume 发展史上的第一次革命 》,从基本组件以及用户体验的角度阐述 Flume OG 到 Flume NG 发生的革命性变化。
核心概念:
- Event: A byte payload with optional string headers that represent the unit of data that Flume can transport from it’s point of origination to it’s final destination.
- Flow: Movement of events from the point of origin to their final destination is considered a data flow, or simply flow. This is not a rigorous definition and is used only at a high level for description purposes.
- Client: An interface implementation that operates at the point of origin of events and delivers them to a Flume agent. Clients typically operate in the process space of the application they are consuming data from. For example, Flume Log4j Appender is a client.
- Agent: An independent process that hosts flume components such as sources, channels and sinks, and thus has the ability to receive, store and forward events to their next-hop destination.
- Source: An interface implementation that can consume events delivered to it via a specific mechanism. For example, an Avro source is a source implementation that can be used to receive Avro events from clients or other agents in the flow. When a source receives an event, it hands it over to one or more channels.
- Channel: A transient store for events, where events are delivered to the channel via sources operating within the agent. An event put in a channel stays in that channel until a sink removes it for further transport. An example of channel is the JDBC channel that uses a file-system backed embedded database to persist the events until they are removed by a sink. Channels play an important role in ensuring durability of the flows.
- Sink: An interface implementation that can remove events from a channel and transmit them to the next agent in the flow, or to the event’s final destination. Sinks that transmit the event to it’s final destination are also known as terminal sinks. The Flume HDFS sink is an example of a terminal sink. Whereas the Flume Avro sink is an example of a regular sink that can transmit messages to other agents that are running an Avro source.
扩展性更加强,能实现更加丰富的日志采集需求(或者说变态的)
真正实现非单点故障的分布式系统,并且大大简化了整体的逻辑设计
可以说flum-ng的实现非常优雅,每一台配置都是独立存在,配合自动化运维
官网上有非常详细的配置说明:source,channel,skin,processor,Interceptor
高可靠性
从单agent来看,Flume使用基于事务的数据传递方式来保证事件传递的可靠性。Source和Sink被封装进一个事务。事件被存放在Channel中直到该事件被处理,Channel中的事件才会被移除。这是Flume提供的点到点的可靠机制。
从多级流来看,前一个agent的sink和后一个agent的source同样有它们的事务来保障数据的可靠性。
可用性:
内建的Load balancing支持 ,内建的Failover支持
目录监控:
SpoolSource:是监测配置的目录下新增的文件,并将文件中的数据读取出来,可实现准实时。需要注意两点:1、拷贝到spool目录下的文件不可以再打开编辑。2、spool目录下不可包含相应的子目录。在实际使用的过程中,可以结合log4j使用,使用log4j的时候,将log4j的文件分割机制设为1分钟一次,将文件拷贝到spool的监控目录。log4j有一个TimeRolling的插件,可以把log4j分割的文件到spool目录。基本实现了实时的监控。Flume在传完文件之后,将会修改文件的后缀,变为.COMPLETED(后缀也可以在配置文件中灵活指定)
核心概念:Interceptor
用于Source的一组Interceptor,按照预设的顺序在必要地方装饰和过滤events。
Interceptor
内建的Interceptors允许增加event的headers比如:时间戳、主机名、静态标记等等
定制的interceptors可以通过内省event payload(读取原始日志),在必要的地方创建一个特定的headers。
Channel Selector
Channel Selector允许Source基于预设的标准,从所有Channel中,选择一个或多个Channel
内建的Channel Selectors:
o 复制Replicating: event被复制到相关的channel o 复用Multiplexing: 基于hearder,event被路由到特定的channel
Sink Processor
多个Sink可以构成一个Sink Group。一个Sink Processor负责从一个指定的Sink Group中激活一个Sink。
Sink Processor可以通过组中所有Sink实现负载均衡;也可以在一个Sink失败时转移到另一个。
Flume通过Sink Processor实现负载均衡(Load Balancing)和故障转移(failover)
内建的Sink Processors:
o Load Balancing Sink Processor – 使用RANDOM, ROUND_ROBIN或定制的选择算法
o Failover Sink Processor
o Default Sink Processor(单Sink)
所有的Sink都是采取轮询(polling)的方式从Channel上获取events。这个动作是通过Sink Runner激活的 Sink Processor充当Sink的一个代理