In my last post, on the same topic above, I outlined the steps, possible issues and how to overcome them when setting up Hive tables, Flume and getting to query the data through Hive.
I realized that there wasn’t any explanation on the structure of the Flume configuration file. Understanding how the Flume configuration file is structured enabled me to quickly configure and understand the basics of setting up Flume and I think this would be useful to anyone who wants to start working with Flume.
As per my experience on new tools (at least new to me), the documentation took me some time to understand and digest and I had to trawl through the web for examples for me to learn.
I will give a quick and brief description of the Flume configuration file so that at the least it will make sense for you when you work on this tutorial / exercise, this will be done section by section as follows.
Flume Sources, Channels and Sinks
A typical way to view how data will be ingested is to think in terms of Sources, Channels and Sinks. Sources define well sources of data. Sinks define the target by which the data will be persisted to. Channels refer to the way the data will be transferred between sources and sinks.
The first thing to do is to define the Sources, Channels and Sinks for a Flume connection.
# Naming the components on the current agent. TwitterAgent.sources = Twitter TwitterAgent.channels = MemChannel TwitterAgent.sinks = HDFS
Typically, the above section details the source, channel and sink for Flume to operate on. Note that Flume can handle multiple sources, channels and sinks within a single configuration file. However, for this exercise / tutorial we will keep it simple with only 1 source, 1 channel and 1 sink.
Here, the string “TwitterAgent” refers to the name of the Flume agent that the properties of “sources”, “channels” and “sinks” belong to.
In essence, by specifying “TwitterAgent” as the string before “sources”, we are configuring the sources property of the Flume agent named “TwitterAgent”. This is important as we will refer to this agent name when running Flume.
Here we have specified the sources to be “Twitter”, using “MemChannel” as the channel and “HDFS” as the sink. The reason why I have put them in quotes will be revealed later. You could have also used any other string – e.g. “Twtr”, “M1”, “S1” etc … to refer to the sources, channels and sinks, but I chose this naming convention to ensure readability.
Generally, naming the components in Flume takes on the following convention:
<agent_name>.sources = <source_name> <agent_name>.channels = <channel_name> <agent_name>.sinks = <sink_name>
Describing the Source
A source will have a list of properties. The property “type” is common to every source, and it is used to specify the type of the source we are using. Examples of source types are: HTTP, Twitter, UDP etc.
# Describing/Configuring the source TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource TwitterAgent.sources.Twitter.consumerKey = <consumer key> TwitterAgent.sources.Twitter.consumerSecret = <consumer secret> TwitterAgent.sources.Twitter.accessToken = <access token> TwitterAgent.sources.Twitter.accessTokenSecret = <access token secret> TwitterAgent.sources.Twitter.keywords = @realDonaldTrump, @HillaryClinton, @SenSanders, @BernieSanders, @tedcruz, #election2016, #hillaryclinton, #hillary, #hillary2016, #Hillary2016, #donaldtrump, #trump, #dumptrump, #pooptrump, #turdtrump, #sanders, #tedcruz, #feelthebern, #dontfeelthebern, #bernie2016, #trump2016, #whybother2016, #trumptrain, #notrump, #whichhillary, #voteforbernie, #sandersonly, #americafortrump, #berniecrats, #berniestrong, #berniesanders2016, #imwithher, #killary, #stepdownhillary, #stophillary, #vote2016
Some sources such as Twitter may require specific Java libraries such as:
You will need to find out if your desired source is supported and if there are any implementation specific libraries required.
Other than the property “type”, different sources may have other required properties of a particular source. For example, if you use the Twitter source as described above, you will be required to include the consumerKey, consumerSecret, accessToken, accessTokenSecret properties. If you had chosen HTTP another different set of properties would then be required to be entered.
You will need to find out for each source type, what are the specific required properties you need to provide in the configuration. This will probably be the most challenging thing during the creation of the Flume config script.
In the above example, the keywords to filter the tweets are defined in the property “keywords”. This means that Flume will capture tweets with any of the above listed keywords.
Specifying the source properties in Flume takes on the following convention:
<agent_name>.sources.<source_name>.type = <value> <agent_name>.sources.<source_name>.<property2> = <value> <agent_name>.sources.<source_name>.<property3> = <value>
Describing the Sink
Similar to the source, each sink (in the case of multiple sinks) will have a separate list of properties. The property “type” is common to every source, and it is used to specify the type of the source we are using. Other than the property “type”, different sinks may have other required properties of a particular sink, as shown below.
# Describing/Configuring the sink TwitterAgent.sinks.HDFS.type = hdfs TwitterAgent.sinks.HDFS.hdfs.path = /twitteranalytics/incremental TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text TwitterAgent.sinks.HDFS.hdfs.filePrefix = twitter- TwitterAgent.sinks.HDFS.hdfs.rollInterval = 0 TwitterAgent.sinks.HDFS.hdfs.rollSize = 524288 TwitterAgent.sinks.HDFS.hdfs.rollCount = 0 TwitterAgent.sinks.HDFS.hdfs.idleTimeout = 0 TwitterAgent.sinks.HDFS.hdfs.batchSize = 100 TwitterAgent.sinks.HDFS.hdfs.threadsPoolSize = 2 TwitterAgent.sinks.HDFS.hdfs.round = true TwitterAgent.sinks.HDFS.hdfs.roundUnit = hour
In the example above, the sink type is “hdfs” and the properties belonging to the sink type HDFS are shown above. You can refer to this website for the list of properties and their definitions: http://hadooptutorial.info/flume-data-collection-into-hdfs/
Specifying the sink properties in Flume takes on the following convention:
<agent_name>.sinks.<source_name>.type = <value> <agent_name>.sinks.<source_name>.<property2> = <value> <agent_name>.sinks.<source_name>.<property3> = <value>
Note that for HDFS, the property format is hdfs.<property>
Describing the Channel
The channel configuration is similar to that of sources and sinks and is as shown below:
# Describing/Configuring the channel TwitterAgent.channels.MemChannel.type = memory TwitterAgent.channels.MemChannel.capacity = 10000 TwitterAgent.channels.MemChannel.transactionCapacity = 100
You can refer to this website for the list of properties and their definitions: http://archive.cloudera.com/cdh4/cdh/4/flume-ng/FlumeUserGuide.html#flume-channels
Finally, to connect the source, channel and sinks together the following needs to be declared in the configuration file. The following configuration specifies that the Twitter source and HDFS sink are both using the same channel “MemChannel”. This effectively binds the source and the sink.
# Binding the source and sink to the channel TwitterAgent.sources.Twitter.channels = MemChannel TwitterAgent.sinks.HDFS.channel = MemChannel
Starting Flume Agent
When starting Flume, you will now specify the specific agent and its relevant configuration to be started as follows:
flume-ng agent -f flume_process_twitter.conf Dflume.root.logger=DEBUG,console -n TwitterAgent
As you will notice, the –name, -n <name> command line parameter specifies the specific agent to execute. Since in this case, the agent name in the configuration file is TwitterAgent, hence all the configuration settings beginning with TwitterAgent will be applied.
Note that this allows us to specify multiple different agents within the same single configuration file.
We have covered a brief overview and walkthrough on configuring Flume. I hope this will help you understand the structure of the Flume configuration file and how to set it up for your next big data project!
Here are links to some of the websites and how-tos which I have found helpful in understanding how to setup Flume.
In the next post, I will go through in detail how to use Spark to execute the sentiment analysis.