Creating a classification model to filter spam

In this scenario, you create Spark Batch Jobs. The key components to be used are as follows:

tModelEncoder: several tModelEncoder components are used to transform given SMS text messages into feature sets.
tRandomForestModel: it analyzes the features incoming from tModelEncoder to build a classification model that understands what a junk message or a normal message can look like.
tPredict: in a new Job, it applies this classification model to process a new set of SMS text messages to classify the spam and the normal messages. In this scenario, the result of this classification is used to evaluate the accuracy of the model, since the classification of the messages processed by tPredict is already known and explicitly marked.
tHDFSConfiguration: this component is used by Spark to connect to the HDFS system where the jar files dependent on the Job are transferred.

This scenario applies only to subscription-based Talend Platform products with Big Data and Talend Data Fabric.

Before you begin

In the Spark Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
- Yarn mode (Yarn client or Yarn cluster):
  - When using Google Dataproc, specify a bucket in the Google Storage staging bucket field in the Spark configuration tab.
  - When using HDInsight, specify the blob to be used for Job deployment in the Windows Azure Storage configuration area in the Spark configuration tab.
  - When using Altus, specify the S3 bucket or the Azure Data Lake Storage for Job deployment in the Spark configuration tab.
  - When using on-premises distributions, use the configuration component corresponding to the file system your cluster is using. Typically, this system is HDFS and so use tHDFSConfiguration.
- Standalone mode: use the configuration component corresponding to the file system your cluster is using, such as tHDFSConfiguration Apache Spark Batch or tS3Configuration Apache Spark Batch.
  
  If you are using Databricks without any configuration component present in your Job, your business data is written directly in DBFS (Databricks Filesystem).
Download the sets of SMS text messages:
- The set used to train the classification models: trainingSet.zip.
- The set used to evaluate the created models: testSet.zip.
Talend created these two sets out of the dataset downloadable from https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection, by using the dataset preparation Job (dataset_preparation.zip) to add 3 feature columns (number of currency symbols, number of numeric values and number of exclamation marks) to the raw dataset and proportionally split the dataset.
An example of the junk messages reads as follows:
```
Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
```
An example of the normal messages reads as follows:
```
Ahhh. Work. I vaguely remember that! What does it feel like? Lol
```
Information noteNote: The new features added to the raw dataset were discovered as the result of the observation of the junk messages used specifically in this scenario (these junk messages often contain prices and/or exclamation marks) and so cannot be generalized for whatever junk messages you want to analyze. In addition, the dataset was randomly split into two sets and used as is but in a real-world practice, you can continue to preprocess them using many different methods such as dataset balancing in order to better train your classification model.
The two sets must be stored in the machine where the Job is going to be executed, for example in the HDFS system of your Yarn cluster if you use the Spark Yarn client mode to run Talend Spark Jobs, and you have appropriate rights and permissions to read data from and write data in this system.
In this scenario, the Spark Yarn client will be used and the datasets are stored in the associated HDFS system.
The Spark cluster to be used must have been properly set up and is running.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here

Creating a classification model to filter spam

Before you begin

In this section

Did this page help you?