pySpark check if file exists

Tags:

When using spark, we often need to check whether a hdfs path exist before load the data, as if the path is not valid, we will get the following exception:

org.apache.hadoop.mapred.InvalidInputException: Input Pattern hdfs://…xxx matches 0 files

In this post,  I describe two methods to check whether a hdfs path exist in pyspark. 

The first solution is to try to load the data and put the code into a try block, we try to read the first element from the RDD. If the file does not exist, there will be  Py4JJavaError. We catch the error and return False. 

The above solution does not always work, as the exception can actually happen because of a network error. 

We can call hdfs command directly to check whether the file exist. See this post on how to execute hadoop hdfs command in python

Then we can run the following hdfs command in python to check whether a hdfs file exist:

The hadoop command to test whether a file exist is as follows:

hdfs dfs test -e hdfs_file

To run a hadoop command in python, we can use the following code:

Reference:

http://www.learn4master.com/big-data/hadoop/most-popular-hadoop-commands

http://stackoverflow.com/questions/31674333/how-to-find-if-a-folder-exists-in-hadoop-or-not