Big Data | Learn for Master
  • Adding Multiple Columns to Spark DataFrames

    Adding Multiple Columns to Spark DataFrames

    from: https://p058.github.io/spark/2017/01/08/spark-dataframes.html

    I have been using spark’s dataframe API for quite sometime and often I would want to add many columns to a dataframe(for ex : Creating more features from existing features for a machine learning model) and find it hard to write many withColumn statements. So I monkey patched spark dataframe to make it easy to add multiple columns to spark dataframe.

    First lets create a udf_wrapper decorator to keep the code concise

    Lets create a spark dataframe with columns, user_id, app_usage (app and number of sessions of each app),

    [Read More...]
  • use spark to calculate moving average for time series data

    Spark Window Functions for DataFrames and SQL

    from: http://xinhstechblog.blogspot.de/2016/04/spark-window-functions-for-dataframes.html

    Introduced in Spark 1.4, Spark window functions improved the expressiveness of Spark DataFrames and Spark SQL. With window functions, you can easily calculate a moving average or cumulative sum, or reference a value in a previous row of a table. Window functions allow you to do many common calculations with DataFrames, without having to resort to RDD manipulation.

    Aggregates, UDFs vs. Window functions

    Window functions are complementary to existing DataFrame operations: aggregates, such as sumand avg, and UDFs. To review,

    [Read More...]
  • start pig using shell script

    Here is an example to start pig script from shell:

    bashprogname=basename $0

    # Various config params

    UDF_JARS_LOCATION=~/udf_jars

    DEFAULT_JOIN_NUM_REDUCERS=2048

    QUEUE=curveball_med

    # Prepare Pig command

    PIG_OPTS=“-Dmapred.job.queue.name=${QUEUE} -Dmapred.child.java.opts=-Xmx2840m -Dmapred.job.map.memory.mb=4548 -Dmapred.job.reduce.memory.mb=4548 -Dmapreduce.jobtracker.split.metainfo.maxsize=20000000 -Dmapreduce.job.acl-view-job=* -Dmapreduce.job.acl-modify-job=* -Dmapreduce.job.classloader=false -Dio.sort.mb=1000 -Dio.sort.factor=100 -Dpig.stats.noTaskReport=true -Dexectype=mapreduce -Dmapred.cache.archives=hdfs…/xxx.tar.gz#achievePrefix -Dmapred.child.env=LD_LIBRARY_PATH=./achievePrefix/xxx/lib64“

    # Pig params

    PIG_PARAMS=“ -param SOME_PATH=${SOME_PATH} -param DEFAULT_JOIN_NUM_REDUCERS=${DEFAULT_JOIN_NUM_REDUCERS} -param QUEUE=${QUEUE}“

    # Pig Jars location

    PIG_JARS_FALCON=“ -Dpig.additional.jars=local…./apps/lib/xxxx.jar:local…/lib/commons-math-2.2.jar$UDF_JARS_LOCATION/datafu-1.2.0.jar:$UDF_JARS_LOCATION/xxxxxx.jar:$UDF_JARS_LOCATION/zzzzz.jar“

    pig -useHCatalog –conf /…./hbase/hbase-site.xml -useversion 0.11 ${PIG_OPTS} ${PIG_PARAMS} ${PIG_JARS_FALCON} somePigScript.pig

    [Read More...]
  • Pig : Container is running beyond physical memory

    Pig : Container is running beyond physical memory limits using oozie

    Here is a good answer from: http://stackoverflow.com/questions/21005643/container-is-running-beyond-memory-limits

    Looks like the default for YARN container size on that cluster (1 GB of RAM) is way too low for your job.

    But it’s not clear whether the YARN error shown relates to the Shell action (running grunt) or to a child MapReduce execution.

    Plan A – Assuming that it’s the child MapReduce execution that requires more RAM, on top of the grunt script just add

    Reference: Pig documentation,

    [Read More...]
  • install spark kernel and pyspark kernel using toree

    Jupyter Notebook (formerly known as IPython Notebook) is an interactive notebook environment which supports various programming languages which allows you to interact with your data, combine code with markdown text and perform simple visualizations.

    Here are just a couple of reasons why using Jupyter Notebook with Spark is the best choice for users that wish to present their work to other team members or to the public in general:

    • Jupyter notebooks support tab autocompletion on class names, functions, methods and variables.
    • It offers more explicit and colour-highlighted error messages than the command line IPython console.
    [Read More...]
  • SQL, Hive Hbase and Impala

    What is Impala?

    Impala is an open source massively parallel processing query engine on top of clustered systems like Apache Hadoop. It was created based on Google’s Dremel paper.  It is an interactive SQL like query engine that runs on top of Hadoop Distributed File System (HDFS). Impala uses HDFS as its underlying storage.

    It integrates with HIVE metastore to share the table information between both the components. Impala makes use of existing Apache Hive (Initiated by Facebook and open sourced to Apache) that many Hadoop users already have in place to perform batch oriented , long-running jobs in form of SQL queries.

    [Read More...]
  • Best articles to learn Hive

    Hive Architectural Overview

    SQL queries are submitted to Hive and they are executed as follows:

    1. Hive compiles the query.

    2. An execution engine, such as Tez or MapReduce, executes the compiled query.

    3. The resource manager, YARN, allocates resources for applications across the cluster.

    4. The data that the query acts upon resides in HDFS (Hadoop Distributed File System). Supported data formats are ORC, AVRO, Parquet, and text.

    5. Query results are then returned over a JDBC/ODBC connection.

    A simplified view of this process is shown in the following figure.

    [Read More...]
  • Best resources to learn Hive partition

    Terminology

    • Static Partition (SP) columns: in DML/DDL involving multiple partitioning columns, the columns whose values are known at COMPILE TIME (given by user).
    • Dynamic Partition (DP) columns: columns whose values are only known at EXECUTION TIME.

    Syntax

    DP columns are specified the same way as it is for SP columns – in the partition clause. The only difference is that DP columns do not have values, while SP columns do. In the partition clause, we need to specify all partitioning columns, even if all of them are DP columns.

    In INSERT …

    [Read More...]
  • How to set the file numbers of hive table using insert command

    Here are some articles to show how to set the file numbers of hive table using insert method:

     

    How to control the file numbers of hive table after inserting data on MapR-FS.
     http://www.openkb.info/2014/12/how-to-control-file-numbers-of-hive.html
    Hive table contains files in HDFS, if one table or one partition has too many small files, the HiveQL performance may be impacted.
    Sometimes, it may take lots of time to prepare a MapReduce job before submitting it, since Hive needs to get the metadata from each file.
    This article explains how to control the file numbers of hive table after inserting data on MapRFS;

    [Read More...]
  • Hive partitioning vs Bucketing

    Hive Bucketing and Partitioning

    To better understand how partitioning and bucketing works, please take a look at how data is stored in hive. Let’s say you have a table

    1. CREATE TABLE mytable (
    2. name string,
    3. city string,
    4. employee_id int )
    5. PARTITIONED BY (year STRING, month STRING, day STRING)
    6. CLUSTERED BY (employee_id) INTO 256 BUCKETS

    You insert some data into a partition for 2015-12-02. Hive will then store data in a directory hierarchy, such as:

    1. /user/hive/warehouse/mytable/y=2015/m=12/d=02

    As such, it is important to be careful when partitioning.

    [Read More...]
Page 1 of 512345