tHDFSConfiguration properties for Apache Spark Batch
These properties are used to configure tHDFSConfiguration running in the Spark Batch Job framework.
The Spark Batch tHDFSConfiguration component belongs to the Storage family.
The component in this framework is available in all subscription-based Talend products with Big Data and Talend Data Fabric.
Basic settings
Property type |
Either Built-In or Repository. Built-In: No property data stored centrally. Repository: Select the repository file where the properties are stored. |
Distribution |
Select the cluster you are using from the drop-down list. The options in the
list vary depending on the component you are using. Among these options, the following
ones requires specific configuration:
|
Hadoop version |
Select the version of the Hadoop distribution you are using. The available options vary depending on the component you are using. |
Use kerberos authentication |
If you are accessing the Hadoop cluster running with Kerberos security, select this check box, then, enter the Kerberos principal name for the NameNode in the field displayed. This enables you to use your username to authenticate against the credentials stored in Kerberos. This check box is available depending on the Hadoop distribution you are connecting to. |
Use a keytab to authenticate |
Select the Use a keytab to authenticate check box to log into a Kerberos-enabled system using a given keytab file. A keytab file contains pairs of Kerberos principals and encrypted keys. You need to enter the principal to be used in the Principal field and the access path to the keytab file itself in the Keytab field. This keytab file must be stored in the machine in which your Job actually runs, for example, on a Talend JobServer. Note that the user that executes a keytab-enabled Job is not necessarily the one a principal designates but must have the right to read the keytab file being used. For example, the username you are using to execute a Job is user1 and the principal to be used is guest; in this situation, ensure that user1 has the right to read the keytab file to be used. |
NameNode URI |
Type in the URI of the Hadoop NameNode, the master node of a Hadoop system. For example, we assume that you have chosen a machine called masternode as the NameNode, then the location is hdfs://masternode:portnumber. If you are using WebHDFS, the location should be webhdfs://masternode:portnumber; WebHDFS with SSL is not supported yet. |
User name |
The User name field is available when you are not using Kerberos to authenticate. In the User name field, enter the login username for your distribution. If you leave it empty, the username of the machine hosting Talend Studio will be used. |
Group |
Enter the membership including the authentication user under which the HDFS instances were started. This field is available depending on the distribution you are using. |
Use datanode hostname |
Select the Use datanode hostname check box to allow the Job to access datanodes via their hostnames. This actually sets the dfs.client.use.datanode.hostname property to true. When connecting to a S3N filesystem, you must select this check box. |
Hadoop properties |
Talend Studio
uses a default configuration for its engine to perform
operations in a Hadoop distribution. If you need to use a custom configuration in a specific
situation, complete this table with the property or properties to be customized. Then at
runtime, the customized property or properties will override those default ones.
For further information about the properties required by Hadoop and its related
systems such as HDFS and Hive, see the documentation of the Hadoop distribution you are
using or see Apache's Hadoop documentation and then select the version of the
documentation you want. For demonstration purposes, the links to some properties are listed
below:
|
Setup HDFS encryption configurations |
If the HDFS transparent encryption has been enabled in your cluster, select the Setup HDFS encryption configurations check box and in the HDFS encryption key provider field that is displayed, enter the location of the KMS proxy. For further information about the HDFS transparent encryption and its KMS proxy, see Transparent Encryption in HDFS. |
Usage
Usage rule |
This component is used with no need to be connected to other components. You need to drop tHDFSConfiguration along with the file system related subJob to be run in the same Job so that the configuration is used by the whole Job at runtime. This component, along with the Spark Batch component Palette it belongs to, appears only when you are creating a Spark Batch Job. Note that in this documentation, unless otherwise explicitly stated, a scenario presents only Standard Jobs, that is to say traditional Talend data integration Jobs. |
Prerequisites |
The Hadoop distribution must be properly installed, so as to guarantee the interaction with Talend Studio . The following list presents MapR related information for example.
For further information about how to install a Hadoop distribution, see the manuals corresponding to the Hadoop distribution you are using. |
Spark Connection |
In the Spark
Configuration tab in the Run
view, define the connection to a given Spark cluster for the whole Job. In
addition, since the Job expects its dependent jar files for execution, you must
specify the directory in the file system to which these jar files are
transferred so that Spark can access these files:
This connection is effective on a per-Job basis. |
Specific Spark timeout |
When encountering network issues, Spark by default waits for up to 45 minutes before stopping its attempts to submits Jobs. Then, Spark triggers the automatic stop of your Job. Add the following properties to the Hadoop properties table of tHDFSConfiguration to reduce this duration.
|