Gathering Web traffic information using Hadoop

To drive a focused marketing campaign based on habits or profiles of your customers or users, you need to be able to fetch data based on their habits or behavior on your website to be able to create user profiles and send them the right advertisements, for example.

The ApacheWebLog folder of the Big Data demo project that comes with your Talend Studio provides an example of finding out users having visited a website most often, by sorting out their IP addresses from a huge number of records in an access log file for an Apache HTTP server to enable further analysis on user behavior on the website. This section describes the procedures for creating and configuring Jobs that will implement this example. For more information about the Big Data demo project, see the Getting Started Guide.

Before discovering this example and creating the Jobs, you should have:

Imported the demo project, and obtained the input access log file used in this example by executing the Job GenerateWebLogFile provided with the demo project.
Installed and started Hortonworks Sandbox virtual appliance that the demo project is designed to work on, as described in the Getting Started Guide.
An IP to host name mapping entry has been added in the hosts file to resolve the host name sandbox.

In this example, certain Talend Big Data components are used to leverage the advantage of the Hadoop open source platform for handling big data. In this scenario we use six Jobs:

The first Job sets up an HCatalog database, table and partition in HDFS
The second Job uploads the access log file to be analyzed to the HDFS file system.
The third Job connects to the HCatalog database and displays the content of the uploaded file on the console.
The fourth Job parses the uploaded access log file, including removing any records with a "404" error, counting the code occurrences in successful service calls to the website, sorting the result data and saving it in the HDFS file system.
The fifth Jobs parse the uploaded access log file, including removing any records with a "404" error, counting the IP address occurrences in successful service calls to the website, sorting the result data and saving it in the HDFS file system.
The last Job reads the result data from HDFS and displays the IP addresses with successful service calls and their number of visits to the website on the standard system console.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here

Gathering Web traffic information using Hadoop

In this section

Did this page help you?