Learn By Example | Learn for Master
  • Hive Import Export


    The EXPORT command exports the data of a table or partition, along with the metadata, into a specified output location. This output location can then be moved over to a different Hadoop or Hive instance and imported from there with the IMPORT command.

    When exporting a partitioned table, the original data may be located in different HDFS locations. The ability to export/import a subset of the partition is also supported.

    Exported metadata is stored in the target directory, and data files are stored in subdirectories.

    The EXPORT and IMPORT commands work independently of the source and target metastore DBMS used;

    [Read More...]
  • Document Summarization using TextRank Example

    TextRank is an algorithm based upon PageRank for text summarization. In TextRank, the vertices of the graph are sentences, and the edge weights between sentences denotes the similarity between sentences.

    Use the following  steps, we can extracte important sentences from a set of documents.

    1. Sentence identification: transfer the documents into sentences
    2. Tokenization: Split each sentence into a set of words
    3. Similarity calculation: Calculate the similarity between sentences
    4. Build sentence graph: build a graph of  the sentences
    5. TextRank: score the sentences via pagerank

    Sentence identification

    We can use nltk’s included Punkt module to get sentences from a document.

    [Read More...]
  • Learn spark by examples (2)

    In the previous post, we have already introduce Spark, RDD, and how to use RDD to do basic data analysis. In this post, I will show more examples on how to use the RDD method.

    Spark RDD reduceByKey Method

    We have used reduceByKey to solve the word frequency calculation problem. Here I will use a more complicated example to show how to use reduceByKey.

    Suppose we have a set of tweets, each was shared by different users. We also give each user a weight denoting his importance.

    Here is an example of the data set.

    [Read More...]
  • Pyspark broadcast variable Example

    Pyspark broadcast variable

    Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. The concept of Broadcast variables is simular to Hadoop’s distributed cache.

    When to use broadcast variable

    The best case to use Broadcast variable is when you want to join two tables and one of them is small. By using Broadcast variable, we can implement a map-side join, which is much faster than reduce side join, as there is no shuffle, which is expensive. 

    Suppose we have the following Rdd, and we want to make join with another Rdd.

    [Read More...]
  • Popular python problems and solutions

    Python is a popular programming language that can be used to conduct almost any project. When you learn python, you may come up with different questions regarding various tasks such as file processing, list, dict usage, database, time, url, et al.  In this tutorial, we give clean solutions to some of the most frequently problems you may encounter when you learn python.

    1. File related questions
      How to check whether a file exists using Python?
      How to check whether a path is a file?
      How to make sure an directory exist?
    [Read More...]
  • An Example on how to setup or change bash prompt (PS1)

    Sometimes you want to change the bash prompt to a more friendly format.
    We can change it using:

    PS1=”my new promot”

    See this example:

    [Read More...]
  • How to keep original file when using gzip/gunzip

    When you use the command gzip file_x , it will compress the data and rename it as file_x.gz.

    In order to keep the original file, we use the following command:

    gzip -c input.txt output.txt.gz

    Similarity, we can keep the origin zip file when we do decompression using gunzip:

    [Read More...]