Run hadoop command in Python
Hadoop is the most widely used big data platform for big data analysis. It is easy to run Hadoop command in Shell or a shell script. However, there is often a need to run manipulate hdfs file directly from python. We use examples to describe how to run hadoop command in python to list, save hdfs files.
We already know how to call an extern shell command from python. We can simply call Hadoop command using the run_cmd method.
1 2 3 4 5 6 7 8 9 10 11 12 |
import subprocess def run_cmd(args_list): print('Running system command: {0}'.format(' '.join(args_list))) proc = subprocess.Popen(args_list, stdout=subprocess.PIPE, stderr=subprocess.PIPE) (output, errors) = proc.communicate() if proc.returncode: raise RuntimeError( 'Error running command: %s. Return code: %d, Error: %s' % ( ' '.join(args_list), proc.returncode, errors)) return (output, errors) |
Run Hadoop ls command in Python
1 2 |
(out, errors)= run_cmd(['hadoop', 'fs', '-ls', 'hdfs_file_path'] lines = out.split('\n') |
Run Hadoop get command in Python
1 |
(out, errors)= run_cmd(['hadoop', 'fs', '-get', 'hdfs_file_path', 'local_path'] |
Run Hadoop put command in Python
1 |
(out, errors)= run_cmd(['hadoop', 'fs', '-put', 'local_file', 'hdfs_file'] |