The EXPORT command exports the data of a table or partition, along with the metadata, into a specified output location. This output location can then be moved over to a different Hadoop or Hive instance and imported from there with the IMPORT command.
When exporting a partitioned table, the original data may be located in different HDFS locations. The ability to export/import a subset of the partition is also supported.
Exported metadata is stored in the target directory, and data files are stored in subdirectories.
The EXPORT and IMPORT commands work independently of the source and target metastore DBMS used;
TextRank is an algorithm based upon PageRank for text summarization. In TextRank, the vertices of the graph are sentences, and the edge weights between sentences denotes the similarity between sentences.
Use the following steps, we can extracte important sentences from a set of documents.
- Sentence identification: transfer the documents into sentences
- Tokenization: Split each sentence into a set of words
- Similarity calculation: Calculate the similarity between sentences
- Build sentence graph: build a graph of the sentences
- TextRank: score the sentences via pagerank
We can use nltk’s included Punkt module to get sentences from a document.
In the previous post, we have already introduce Spark, RDD, and how to use RDD to do basic data analysis. In this post, I will show more examples on how to use the RDD method.
Spark RDD reduceByKey Method
We have used reduceByKey to solve the word frequency calculation problem. Here I will use a more complicated example to show how to use reduceByKey.
Suppose we have a set of tweets, each was shared by different users. We also give each user a weight denoting his importance.
Here is an example of the data set.
Pyspark broadcast variable
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. The concept of Broadcast variables is simular to Hadoop’s distributed cache.
When to use broadcast variable
The best case to use Broadcast variable is when you want to join two tables and one of them is small. By using Broadcast variable, we can implement a map-side join, which is much faster than reduce side join, as there is no shuffle, which is expensive.
Suppose we have the following Rdd, and we want to make join with another Rdd.
Python is a popular programming language that can be used to conduct almost any project. When you learn python, you may come up with different questions regarding various tasks such as file processing, list, dict usage, database, time, url, et al. In this tutorial, we give clean solutions to some of the most frequently problems you may encounter when you learn python.
File related questions
How to check whether a file exists using Python?
How to check whether a path is a file?
How to make sure an directory exist?
Sometimes you want to change the bash prompt to a more friendly format.
We can change it using:
PS1=”my new promot”
See this example:
brawldare-lm:gzip_learn username$ echo $PS1
# \h : the hostname up to the first ‘.’
# \u : the username of the current user
# \W : the basename of the current working directory, with $HOME abbreviated with a tilde
# Only keep the basename of the current working directory + ">>>"
#How about show the full working directory? use "\w"
# \w : the current working directory, with $HOME abbreviated with a tilde
When you use the command gzip file_x , it will compress the data and rename it as file_x.gz.
brawldare-lm:gzip_learn$ gzip file_x
In order to keep the original file, we use the following command:
gzip -c input.txt output.txt.gz
brawldare-lm:gzip_learn>>gzip -c file_x > file_x.gzip
Similarity, we can keep the origin zip file when we do decompression using gunzip:
brawldare-lm:gzip_learn>>gunzip -c file_x.gzip > gg
file_x file_x.gzip gg