Modeling the accident-prone areas in a city

In this scenario, the tKMeansModel component is used to analyze a set of sample geographical data about the destination of ambulances in a city in order to model the accident-prone areas.

Before you begin

The Spark version to be used is 1.4 onwards.
The sample data is stored in your Hadoop file system and you have proper rights and permissions to at least read it.
Your Hadoop cluster is properly installed and is running.

If you are not sure about these requirements, ask the administrator of your Hadoop system.
In the Spark Configuration tab in the Run view, define the connection to a given Spark cluster for the whole Job. In addition, since the Job expects its dependent jar files for execution, you must specify the directory in the file system to which these jar files are transferred so that Spark can access these files:
- Yarn mode (Yarn client or Yarn cluster):
  - When using Google Dataproc, specify a bucket in the Google Storage staging bucket field in the Spark configuration tab.
  - When using HDInsight, specify the blob to be used for Job deployment in the Windows Azure Storage configuration area in the Spark configuration tab.
  - When using Altus, specify the S3 bucket or the Azure Data Lake Storage for Job deployment in the Spark configuration tab.
  - When using on-premises distributions, use the configuration component corresponding to the file system your cluster is using. Typically, this system is HDFS and so use tHDFSConfiguration.
- Standalone mode: use the configuration component corresponding to the file system your cluster is using, such as tHDFSConfiguration Apache Spark Batch or tS3Configuration Apache Spark Batch.
  
  If you are using Databricks without any configuration component present in your Job, your business data is written directly in DBFS (Databricks Filesystem).

About this task

This scenario applies only to subscription-based Talend products with Big Data.
A model like this can be employed to help determine the optimal locations for building hospitals.
The sample data consists of pairs of latitudes and longitudes. It was randomly and automatically generated for demonstration purposes only and in any case it does not reflect the situation of these areas in the real world.
The components to be used are:
- tFileInputDelimited: it loads the sample data into the data flow of the Job.
- tReplicate: it replicates the sample data and caches the replication.
- tKMeansModel: it analyzes the data to train the model and writes the model to HDFS.
- tModelEncoder: it pre-process the data to prepare proper feature vectors to be used by tKMeansModel.
- tPredict: it applies the KMeans model on the replication of the sample data. In the real-world practice, this data should be a set of reference data to test the model accuracy.
- tFileOutputDelimited: it writes the result of the prediction to HDFS.
- tHDFSConfiguration: this component is used by Spark to connect to the HDFS system where the jar files dependent on the Job are transferred.

Arranging data flow for the KMeans Job

Procedure

In the Integration perspective of Talend Studio, create an empty Job from the Job Designs node in the Repository tree view.
In the workspace, enter the name of the component to be used and select it from the list that appears.
Connect tFileInputDelimited to tReplicate using the Row > Main link.
Do the same to connect tReplicate to tModelEncoder and then tModelEncoder to tKMeansModel.
Repeat the operations to connect tReplicate to tPredict and then tPredict to tFileOutputDelimited.
Leave tHDFSConfiguration as it is.

Configuring the connection to the file system to be used by Spark

See the procedure in the Getting Started Guide.

Reading and caching the sample data

Procedure

Double-click the first tFileInputInput component to open its Component view.
Click the [...] button next to Edit schema and in the pop-up schema dialog box, define the schema by adding two columns latitude and longitude of Double type.
Click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.
Select the Define a storage configuration component check box and select the tHDFSConfiguration component to be used.
tFileInputDelimited uses this configuration to access the sample data to be used as training set.
In the Folder/File field, enter the directory where the training set is stored.
Double-click the tReplicate component to open its Component view.
Select the Cache replicated RDD check box and from the Storage level drop-down list, select Memory only. This way, this sample data is replicated and stored in memory for use as test set.

Preparing features for KMeans

Procedure

Double-click the tModelEncoder component to open its Component view.
Click the [...] button next to Edit schema and on the tModelEncoder side of the pop-up schema dialog box, define the schema by adding one column named map of Vector type.
Click OK to validate these changes and accept the propagation prompted by the pop-up dialog box.
In the Transformations table, add one row by clicking the [+] button and then proceed as follows:
1. In the Output column column, select the column that carry the features. In this scenario, it is map.
2. In the Transformation column, select the algorithm to be used for the transformation. In this scenario, it is Vector assembler.
3. In the Parameters column, enter the parameters you want to customize for use in the Vector assembler algorithm. In this scenario, enter inputCols=latitude,longitude.
In this transformation, tModelEncoder combines all feature vectors into one single feature column.
Double-click tKMeansModel to open its Component view.
Select the Define a storage configuration component check box and select the tHDFSConfiguration component to be used.
From the Vector to process list, select the column that provides the feature vectors to be analyzed. In this scenario, it is map, which combines all features.
Select the Save the model on file system check box and in the HDFS folder field that is displayed, enter the directory you want to use to store the generated model.
In the Number of cluster field, enter the number of decision trees you want tKMeans to build. You need to try different numbers to run the current Job to create the clustering model several times; after comparing the evaluation results of every model created on each run, you can decide the number you need to use. For example, put 6.
You need to write the evaluation code yourself.
From the Initialization function, select Random. In general, this mode is used for simple datasets.
Leave the other parameters as they are.

Testing the KMeans model

Procedure

Double-click tPredict to open its Component view.
Select the Define a storage configuration component check box and select the tHDFSConfiguration component to be used.
From the Model type drop-down list, select Kmeans model.
Select the Model on filesystem radio button and enter the directory in which the KMeans model is stored.
In this case, the tPredict component contains a read-only column called label in which the model provides the labels of the clusters.
Double-click tFileOutputDelimited to open its Component view.
Select the Define a storage configuration component check box and select the tHDFSConfiguration component to be used.
In the Folder field, browse to the location in HDFS in which you want to store the prediction result.
From the Action drop-down list, select Overwrite. When the target folder does not exist, select Create.
Select the Merge result to single file check box and then the Remove source dir check box.
In the Merge file path field, browse to the location in HDFS in which you want to store the merged prediction result.

Selecting the Spark mode

See the procedure in the Getting Started Guide.

Executing the Job

Procedure

Press Ctrl + S to save the Job.
Press F6 to run the Job.
The merged prediction result is stored in HDFS and you can evaluate this result using your evaluation process. Then run this Job more times with different KMeans parameters in order to obtain the optimal model.

Results

The following image shows an example of the predicted clusters. This visualization is produced via a Python script. You can download this script from here and bear in mind to adapt the path in the script to access the prediction result in your own machine.

Did this page help you?

If you find any issues with this page or its content – a typo, a missing step, or a technical error – please let us know!

Leave your feedback here