Simple Twitter Sentiment Analytics Using Apache Flume and Spark – Part 2

In my last post, on the same topic above, I outlined the steps, possible issues and how to overcome them when setting up Hive tables, Flume and getting to query the data through Hive.

I realized that there wasn’t any explanation on the structure of the Flume configuration file. Understanding how the Flume configuration file is structured enabled me to quickly configure and understand the basics of setting up Flume and I think this would be useful to anyone who wants to start working with Flume.

As per my experience on new tools (at least new to me), the documentation took me some time to understand and digest and I had to trawl through the web for examples for me to learn.

I will give a quick and brief description of the Flume configuration file so that at the least it will make sense for you when you work on this tutorial / exercise, this will be done section by section as follows.

Lets start!

Flume Sources, Channels and Sinks

A typical way to view how data will be ingested is to think in terms of Sources, Channels and Sinks. Sources define well sources of data. Sinks define the target by which the data will be persisted to. Channels refer to the way the data will be transferred between sources and sinks.

The first thing to do is to define the Sources, Channels and Sinks for a Flume connection.

# Naming the components on the current agent.
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS

Typically, the above section details the source, channel and sink for Flume to operate on. Note that Flume can handle multiple sources, channels and sinks within a single configuration file. However, for this exercise / tutorial we will keep it simple with only 1 source, 1 channel and 1 sink.

Here, the string “TwitterAgent” refers to the name of the Flume agent that the properties of “sources”, “channels” and “sinks” belong to.

In essence, by specifying “TwitterAgent” as the string before “sources”, we are configuring the sources property of the Flume agent named “TwitterAgent”. This is important as we will refer to this agent name when running Flume.

Here we have specified the sources to be “Twitter”, using “MemChannel” as the channel and “HDFS” as the sink. The reason why I have put them in quotes will be revealed later. You could have also used any other string – e.g. “Twtr”, “M1”, “S1” etc … to refer to the sources, channels and sinks, but I chose this naming convention to ensure readability.

Generally, naming the components in Flume takes on the following convention:

<agent_name>.sources = <source_name>
<agent_name>.channels = <channel_name>
<agent_name>.sinks = <sink_name>

 

Describing the Source

A source will have a list of properties. The property “type” is common to every source, and it is used to specify the type of the source we are using. Examples of source types are: HTTP, Twitter, UDP etc.

# Describing/Configuring the source
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.consumerKey = <consumer key>
TwitterAgent.sources.Twitter.consumerSecret = <consumer secret>
TwitterAgent.sources.Twitter.accessToken = <access token>
TwitterAgent.sources.Twitter.accessTokenSecret = <access token secret>
TwitterAgent.sources.Twitter.keywords = @realDonaldTrump, @HillaryClinton, @SenSanders, @BernieSanders, @tedcruz, #election2016, #hillaryclinton, #hillary, #hillary2016, #Hillary2016, #donaldtrump, #trump, #dumptrump, #pooptrump, #turdtrump, #sanders, #tedcruz, #feelthebern, #dontfeelthebern, #bernie2016, #trump2016, #whybother2016, #trumptrain, #notrump, #whichhillary, #voteforbernie, #sandersonly, #americafortrump, #berniecrats, #berniestrong, #berniesanders2016, #imwithher, #killary, #stepdownhillary, #stophillary, #vote2016

Some sources such as Twitter may require specific Java libraries such as:

com.cloudera.flume.source.TwitterSource
com.apache.flume.source.TwitterSource

You will need to find out if your desired source is supported and if there are any implementation specific libraries required.

Other than the property “type”, different sources may have other required properties of a particular source. For example, if you use the Twitter source as described above, you will be required to include the  consumerKey, consumerSecret, accessToken, accessTokenSecret properties. If you had chosen HTTP another different set of properties would then be required to be entered.

You will need to find out for each source type, what are the specific required properties you need to provide in the configuration. This will probably be the most challenging thing during the creation of the Flume config script.

In the above example, the keywords to filter the tweets are defined in the property “keywords”. This means that Flume will capture tweets with any of the above listed keywords.

Specifying the source properties in Flume takes on the following convention:

<agent_name>.sources.<source_name>.type = <value>
<agent_name>.sources.<source_name>.<property2> = <value>
<agent_name>.sources.<source_name>.<property3> = <value>

 

Describing the Sink

Similar to the source, each sink (in the case of multiple sinks) will have a separate list of properties. The property “type” is common to every source, and it is used to specify the type of the source we are using. Other than the property “type”, different sinks may have other required properties of a particular sink, as shown below.

# Describing/Configuring the sink
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = /twitteranalytics/incremental
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.filePrefix = twitter-
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 0
TwitterAgent.sinks.HDFS.hdfs.rollSize = 524288
TwitterAgent.sinks.HDFS.hdfs.rollCount = 0
TwitterAgent.sinks.HDFS.hdfs.idleTimeout = 0
TwitterAgent.sinks.HDFS.hdfs.batchSize = 100
TwitterAgent.sinks.HDFS.hdfs.threadsPoolSize = 2
TwitterAgent.sinks.HDFS.hdfs.round = true
TwitterAgent.sinks.HDFS.hdfs.roundUnit = hour

In the example above, the sink type is “hdfs” and the properties belonging to the sink type HDFS are shown above. You can refer to this website for the list of properties and their definitions: http://hadooptutorial.info/flume-data-collection-into-hdfs/

Specifying the sink properties in Flume takes on the following convention:

<agent_name>.sinks.<source_name>.type = <value>
<agent_name>.sinks.<source_name>.<property2> = <value>
<agent_name>.sinks.<source_name>.<property3> = <value>

Note that for HDFS, the property format is hdfs.<property>

 

Describing the Channel

The channel configuration is similar to that of sources and sinks and is as shown below:

# Describing/Configuring the channel
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100

You can refer to this website for the list of properties and their definitions: http://archive.cloudera.com/cdh4/cdh/4/flume-ng/FlumeUserGuide.html#flume-channels

 

Bindings

Finally, to connect the source, channel and sinks together the following needs to be declared in the configuration file. The following configuration specifies that the Twitter source and HDFS sink are both using the same channel “MemChannel”. This effectively binds the source and the sink.

# Binding the source and sink to the channel
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel

 

Starting Flume Agent

When starting Flume, you will now specify the specific agent and its relevant configuration to be started as follows:

flume-ng agent -f flume_process_twitter.conf Dflume.root.logger=DEBUG,console -n TwitterAgent

As you will notice, the –name, -n <name> command line parameter specifies the specific agent to execute. Since in this case, the agent name in the configuration file is TwitterAgent, hence all the configuration settings beginning with TwitterAgent will be applied.

Note that this allows us to specify multiple different agents within the same single configuration file.

That’s it!

We have covered a brief overview and walkthrough on configuring Flume. I hope this will help you understand the structure of the Flume configuration file and how to set it up for your next big data project!

Here are links to some of the websites and how-tos which I have found helpful in understanding how to setup Flume.

 

In the next post, I will go through in detail how to use Spark to execute the sentiment analysis.

 

Advertisements

13 thoughts on “Simple Twitter Sentiment Analytics Using Apache Flume and Spark – Part 2

  1. Facing the error “Unexpected character (‘O’ (code 79)): expected a valid value (number, String, array, object, ‘true’, ‘false’ or ‘null’)), When I am trying to query on the hive table.
    Ex. select * from table_name limit 1;

    Like

  2. Hey I am getting the same error as Ramesh. Here is table

    CREATE EXTERNAL TABLE tweets (
    id BIGINT,
    created_at STRING,
    source STRING,
    favorited BOOLEAN,
    retweet_count INT,
    retweeted_status STRUCT<
    text:STRING,
    user:STRUCT<screen_name:STRING,name:STRING>>,
    entities STRUCT<
    urls:ARRAY<STRUCT>,
    user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
    hashtags:ARRAY<STRUCT>>,
    text STRING,
    user STRUCT<
    screen_name:STRING,
    name:STRING,
    friends_count:INT,
    followers_count:INT,
    statuses_count:INT,
    verified:BOOLEAN,
    utc_offset:INT,
    time_zone:STRING>,
    in_reply_to_screen_name STRING
    )
    ROW FORMAT SERDE ‘com.cloudera.hive.serde.JSONSerDe’
    LOCATION ‘/user/flume/tweets’;

    and my query

    SELECT count(*) FROM tweets;

    Like

    1. Hi Mike,

      Thanks for the post. I am currently on a full time project and may not be able to troubleshoot this error immediately but I think the table structure (based on the original Schema.hql script) may be outdated, perhaps due to a change in the Twitter JSON structure. I would suggest that you download the JSON structure from Twitter and modify the Schema.hql script accordingly, drop the HIVE table and re-run the updated Schema.hql script.

      Once I have some downtime, I will take a deeper look into this. Thanks for your kind understanding!

      Like

  3. Also one quick question. Does the create statement I posted also load the data into the table or do I have to do that manually?

    Like

  4. Excellent tutorial; I would appreciate a lot if you could complete it with the last part in order to see how to realize the sentiment analysis. Thank you very much.

    Like

    1. Hi Max, glad that you like it. Will complete it towards the end of Dec after my customer presentations. Appreciate your kind understanding and patience.

      Like

      1. Hi, any possibility to see the end of tutorial in next days? I hope you could take some time to complete it; many thanks.

        Like

      2. Hi Max, have been busy with project work – will try to attempt to work on it over the weekend if

        Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s