flume info

flume 简介

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Flume NG是什么?
Flume NG是一个分布式,高可用,可靠的系统,它能将不同的海量数据收集,移动并存储到一个数据存储系统中.
轻量,配置简单,适用于各种日志收集,并支持Failover和负载均衡.并且它拥有非常丰富的组件.

Flume流程:
Events(Client => Agent(source => channel => sink))

Flume核心概念:
1.Event:一个数据单元,带有一个可选的消息头
2.Flow:Event从源点到达目的点的迁移的抽象
3.Client:操作位于源点处的Event,将其发送到Flume Agent
4.Agent:一个独立的Flume进程,包含组件Source、Channel、Sink
5.Source:用来消费传递到该组件的Event
6.Channel:中转Event的一个临时存储,保存有Source组件传递过来的Event
7.Sink:从Channel中读取并移除Event,将Event传递到Flow Pipeline中的下一个Agent(如果有的话)

Flume原理:
Flume的核心是把数据从数据源收集过来,再送到目的地.为了保证输送一定成功,在送到目的地之前,会先缓存数据,待数据真正到达目的地后,删除自己缓存的数据
Flume传输的数据的基本单位是Event,如果是文本文件,通常是一行记录,这也是事务的基本单位.Event从Source流向Channel,再到Sink,本身为一个byte 数组,并可携带headers信息.Event代表着一个数据流的最小完整单元,从外部数据源来,向外部的目的地去
Flume运行的核心是Agent.它是一个完整的数据收集工具,含有三个核心组件,分别是source,channel,sink.通过这些组件,Event 可以从一个地方流向另一个地方.

Flume Agent三大组件:
1.source
可以接收外部源发送过来的数据.不同的source,可以接受不同的数据格式.
比如:有目录池(spooling directory)数据源,可以监控指定文件夹中的新文件变化,如果目录中有文件产生,就会立刻读取其内容
2.channel
是一个存储地,接收source的输出,直到有sink消费掉channel中的数据
channel中的数据直到进入到下一个channel中或者进入终端才会被删除.当sink写入失败后,可以自动重启,不会造成数据丢失,因此很可靠.
3.sink会消费channel中的数据,然后送给外部源或者其他source.如数据可以写入到HDFS或者HBase中

  1. flume 配置文件示例一

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    penn@ubuntu:/mnt/app/flume$ cat conf/test.conf
    # name the components on this agent
    a1.sources = r1
    a1.sinks = k1
    a1.channels = c1

    # source
    a1.sources.r1.type = netcat
    a1.sources.r1.bind = localhost
    a1.sources.r1.port = 4444

    # channel
    a1.channels.c1.type = memory
    a1.channels.c1.capacity = 1000
    a1.channels.c1.transactionCapacity = 100

    # sink
    a1.sinks.k1.type = logger

    # bind source and sink to the channel
    a1.sources.r1.channels = c1
    a1.sinks.k1.channel = c1

    penn@ubuntu:/mnt/app/flume$ ./bin/flume-ng agent --conf ./conf --conf-file ./conf/test.conf --name a1 -Dflume.root.logger=INFO,console
  2. flume 配置文件示例二

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    1.source instance can specify multiple channels, but a sink instance can only specify one channel. The format is as follows:
    # list the sources, sinks and channels for the agent
    <Agent>.sources = <Source>
    <Agent>.sinks = <Sink>
    <Agent>.channels = <Channel1> <Channel2>

    # set channel for source
    <Agent>.sources.<Source>.channels = <Channel1> <Channel2> ...

    # set channel for sink
    <Agent>.sinks.<Sink>.channel = <Channel1>

    2.Adding multiple flows in an agent
    <Agent>.sources = <Source1> <Source2>
    <Agent>.sinks = <Sink1> <Sink2>
    <Agent>.channels = <Channel1> <Channel2>

    3.In the replicating flow, the event is sent to all the configured channels. In case of multiplexing, the event is sent to only a subset of qualifying channels.Once all required channels have consumed the events, then the selector will attempt to write to the optional channels.
    Note that if a header does not have any required channels, then the event will be written to the default channels and will be attempted to be written to the optional channels for that header. Specifying optional channels will still cause the event to be written to the default channels, if no required channels are specified. If no channels are designated as default and there are no required, the selector will attempt to write the events to the optional channels. Any failures are simply ignored in that case.

    # list the sources, sinks and channels in the agent
    agent_foo.sources = avro-AppSrv-source1
    agent_foo.sinks = hdfs-Cluster1-sink1 avro-forward-sink2
    agent_foo.channels = mem-channel-1 file-channel-2

    # set channels for source
    agent_foo.sources.avro-AppSrv-source1.channels = mem-channel-1 file-channel-2

    # set channel for sinks
    agent_foo.sinks.hdfs-Cluster1-sink1.channel = mem-channel-1
    agent_foo.sinks.avro-forward-sink2.channel = file-channel-2

    # channel selector configuration
    agent_foo.sources.avro-AppSrv-source1.selector.type = multiplexing
    agent_foo.sources.avro-AppSrv-source1.selector.header = State
    agent_foo.sources.avro-AppSrv-source1.selector.mapping.CA = mem-channel-1
    agent_foo.sources.avro-AppSrv-source1.selector.mapping.AZ = file-channel-2
    agent_foo.sources.avro-AppSrv-source1.selector.mapping.NY = mem-channel-1 file-channel-2
    agent_foo.sources.avro-AppSrv-source1.selector.optional.CA = mem-channel-1 file-channel-2
    agent_foo.sources.avro-AppSrv-source1.selector.default = mem-channel-1