OPENCONNECTOR: DistCp
Use Case: Distributed copy command (DistCp) is used to copy data files from one HDFS source location to same or different HDFS destination cluster node (ex. loadingDockLocation for ingest, shippingLocation for publish) recursively.
Example of a DistCp script with arguments: /usr/local/podium/datasets/put_file_hdfs.sh %prop.p1 %loadingDockLocation
Example of a DistCp location property (defined by API): /usr/local/podium/datasets/ENGINE.utf8.bom.txt
For the example above, the property (p1) is created by the user and can be passed directly to the script.
Script example for property: entity.custom.script.args:
/user/local/podium/usedistcp.sh %prop.p1 %loadingDockLocation
Script: /user/local/podium/usedistcp.sh
%prop.p1 – First argument the script will take: '%prop' tells the application to use the value set in property: p1.
p1 can be anything the user defined when creating the new property.
For the example above it specifies the input path.
%loadingDockLocation – This is a required argument that every OPENCONNECTOR must take; this path value is automatically generated by the application and passed to the script. The script using distcp will copy the file to this location (Note: This property will always be '%loadingDockLocation' for Ingest OPENCONNECTOR, in the case of Publish it is always '%shippingLocation').
Bash DistCp script:
#!/bin/bash
#usedistcp.sh - copies source location contents to destination location recursively into same or another HDFS node.
#arguments:
#1 - source location
#2 - destination location
#result: source location contents copied in to destination location recursively
#if (($#!=2)); then
#echo 'Usage: <source hdfs location> <destination hdfs location>'
#fi
#hadoop fs -mkdir /tmp/user
#echo $1
#echo $2
#hadoop distcp $1 $2