.
  • Most Popular Deep Learning Projects

    Top Deep Learning Projects

    A list of popular github projects related to deep learning (ranked by stars).

    Last Update: 2016.08.09

    Project Name
    Stars
    Description

    TensorFlow
    29622
    Computation using data flow graphs for scalable machine learning.

    Caffe
    11799
    Caffe: a fast open framework for deep learning.

    Neural Style
    10148
    Torch implementation of neural style algorithm.

    Deep Dream
    9042
    Deep Dream.

    Keras
    7502
    Deep Learning library for Python. Convnets, recurrent neural networks, and more. Runs on Theano and TensorFlow.

    Roc AlphaGo
    7170
    An independent, student-led replication of DeepMind’s 2016 Nature publication,

    [Read More...]
  • Machine learning in 10 pictures

    Machine learning in 10 pictures

     from: http://www.denizyuret.com/2014/02/machine-learning-in-5-pictures.html
     
    I find myself coming back to the same few pictures when explaining basic machine learning concepts. Below is a list I find most illuminating.

    1. Test and training error: Why lower training error is not always a good thing: ESL Figure 2.11. Test and training error as a function of model complexity.

    2. Under and overfitting: PRML Figure 1.4. Plots of polynomials having various orders M, shown as red curves, fitted to the data set generated by the green curve.

    [Read More...]
  • Best examples to learn machine learning

     

    Here are some good examples to learn machine learning and data science using python pandas.

    The following resources are from https://github.com/savarin/pyconuk-introtutorial

    The tutorial will start with data manipulation using pandas – loading data, and cleaning data. We’ll then use scikit-learn to make predictions. By the end of the session, we would have worked on the Kaggle Titanic competition from start to finish, through a number of iterations in an increasing order of sophistication. We’ll also have a brief discussion on cross-validation and making visualisations.

    [Read More...]
  • visualize iris dataset using python

    This notebook demos Python data visualizations on the Iris dataset

    from: https://www.kaggle.com/benhamner/d/uciml/iris/python-data-visualizations

    This Python 3 environment comes with many helpful analytics libraries installed. It is defined by the kaggle/python docker image

    We’ll use three libraries for this tutorial: pandas, matplotlib, and seaborn.

    Press “Fork” at the top-right of this screen to run this notebook yourself and build each of the examples.

    In [1]:

    Out[1]:

     
    Id
    SepalLengthCm
    SepalWidthCm
    PetalLengthCm
    PetalWidthCm
    Species

    0
    1
    5.1
    3.5
    1.4
    0.2
    Iris-setosa

    1
    2
    4.9
    3.0
    1.4
    0.2
    Iris-setosa

    2
    3
    4.7
    3.2
    1.3
    0.2
    Iris-setosa

    3
    4
    4.6
    3.1
    1.5
    0.2
    Iris-setosa

    4
    5
    5.0
    3.6
    1.4
    0.2
    Iris-setosa

    In [2]:

    Out[2]:

    In [3]:

    Out[3]:

     

    In [4]:

    Out[4]:

     

    In [5]:

    Out[5]:

     

    In [6]:

    Out[6]:

     

    In [7]:

     

    In [8]:

    Out[8]:

     

    In [9]:

    Out[9]:

     

    In [10]:

    Out[10]:

     

    In [11]:

    Out[11]:

     

    In [12]:

    Out[12]:

     

    In [13]:

    Out[13]:

     

    In [14]:

    Out[14]:

     

    In [15]:

    Out[15]:

     

     

    Wrapping Up

    I hope you enjoyed this quick introduction to some of the quick,

    [Read More...]
  • 如何在 Kaggle 首战中进入前 10%

    如何在 Kaggle 首战中进入前 10%

    from: https://dnc1994.com/2016/04/rank-10-percent-in-first-kaggle-competition/

    Introduction

    Kaggle 是目前最大的 Data Scientist 聚集地。很多公司会拿出自家的数据并提供奖金,在 Kaggle 上组织数据竞赛。我最近完成了第一次比赛,在 2125 个参赛队伍中排名第 98 位(~ 5%)。因为是第一次参赛,所以对这个成绩我已经很满意了。在 Kaggle 上一次比赛的结果除了排名以外,还会显示的就是 Prize Winner,10% 或是 25% 这三档。所以刚刚接触 Kaggle 的人很多都会以 25% 或是 10% 为目标。在本文中,我试图根据自己第一次比赛的经验和从其他 Kaggler 那里学到的知识,为刚刚听说 Kaggle 想要参赛的新手提供一些切实可行的冲刺 10% 的指导。

    本文的英文版见这里

    Kaggle Profile

    Kaggler 绝大多数都是用 Python 和 R 这两门语言的。因为我主要使用 Python,所以本文提到的例子都会根据 Python 来。不过 R 的用户应该也能不费力地了解到工具背后的思想。

    首先简单介绍一些关于 Kaggle 比赛的知识:

    • 不同比赛有不同的任务,分类、回归、推荐、排序等。比赛开始后训练集和测试集就会开放下载。
    • 比赛通常持续 2 ~ 3 个月,每个队伍每天可以提交的次数有限,通常为 5 次。
    • 比赛结束前一周是一个 Deadline,在这之后不能再组队,也不能再新加入比赛。所以想要参加比赛请务必在这一 Deadline 之前有过至少一次有效的提交。
    • 一般情况下在提交后会立刻得到得分的反馈。不同比赛会采取不同的评分基准,可以在分数栏最上方看到使用的评分方法。
    • 反馈的分数是基于测试集的一部分计算的,剩下的另一部分会被用于计算最终的结果。所以最后排名会变动。
    • LB 指的就是在 Leaderboard 得到的分数,由上,有 Public LB 和 Private LB 之分。
    • 自己做的 Cross Validation 得到的分数一般称为 CV 或是 Local CV。一般来说 CV 的结果比 LB 要可靠。
    • 新手可以从比赛的 Forum 和 Scripts 中找到许多有用的经验和洞见。不要吝啬提问,Kaggler 都很热情。

    那么就开始吧!

    P.S.

    [Read More...]
  • Using Azure ML to Build Clickthrough Prediction Models

    Using Azure ML to Build Clickthrough Prediction Models
     

    This blog post is by Girish Nathan, a Senior Data Scientist at Microsoft.

    Ad click prediction is a multi-billion dollar industry, and one that is still growing rapidly. In this post, we build ML models on the largest publicly available ad click prediction dataset, from Criteo. The Criteo dataset consists of some 4.4 billion advertising feedback events. In Criteo’s words, “…this dataset contains feature values and click feedback for millions of display ads. Its purpose is to benchmark algorithms for clickthrough rate (CTR) prediction.”

    Azure services provide the tools needed to build a predictive model using this data.

    [Read More...]
  • Data Transformation methods: one hot encoding, learning with counts

    Data Transformation

    One hot encoding transforms categorical features to a format that works better with classification and regression algorithms.

    Let’s take the following example. I have seven sample inputs of categorical data belonging to four categories. Now, I could encode these to nominal values as I have done here, but that wouldn’t make sense from a machine learning perspective. We can’t say that the category of “Penguin” is greater or smaller than “Human”. Then they would be ordinal values, not nominal.

    What we do instead is generate one boolean column for each category. Only one of these columns could take on the value 1 for each sample.

    [Read More...]
  • 用python参加Kaggle的经验总结

    from: http://www.jianshu.com/p/32def2294ae6

    最近挤出时间,用python在kaggle上试了几个project,有点体会,记录下。

    Step1: Exploratory Data Analysis

    EDA,也就是对数据进行探索性的分析,一般就用到pandas和matplotlib就够了。EDA一般包括:

    1. 每个feature的意义,feature的类型,比较有用的代码如下
    2. 看是否存在missing value
    3. 每个特征下的数据分布,可以用boxplot或者hist来看
    4. 如果想看几个feature之间的联立情况,则可以用pandas的groupby,
      temp = pd.crosstab([df.Pclass, df.Sex], df.Survived.astype(bool))
      temp.plot(kind=’bar’, stacked=True, color=[‘red’,’blue’], grid=False)

    在这步完成之后,要对以下几点有大致了解

    • 理解每个特征的意义
    • 要知道哪些特征是有用的,这些特征哪些是直接可以用的,哪些需要经过变换才能用,为之后的特征工程做准备

    Step2: Data Preprocessing

    数据预处理,就是将数据处理下,为模型输入做准备,其中包括:

    • 处理missing value:这里学问有点深,如果各位有好的经验可以跟我交流下。以我浅薄的经验来说我一般会分情况处理
      1. 如果missing value占总体的比例非常小,那么直接填入平均值或者众数
      2. 如果missing value所占比例不算小也不算大,那么可以考虑它跟其他特征的关系,如果关系明显,那么直接根据其他特征填入;也可以建立简单的模型,比如线性回归,随机森林等。
      3. 如果missing value所占比例大,那么直接将miss value当做一种特殊的情况,另取一个值填入
    • 处理Outlier:这个就是之前EDA的作用了,通过画图,找出异常值
    • 处理categorical feature:一般就是通过dummy variable的方式解决,也叫one hot encode,可以通过pandas.get_dummies()或者 sklearn中preprocessing.OneHotEncoder(), 我个人倾向于用pandas的get_dummies()
      看个例子吧,

      dummy variable

      将一列的month数据展开为了12列,用0、1代表类别。
      另外在处理categorical feature有两点值得注意:

      1. 如果特征中包含大量需要做dummy variable处理的,那么很可能导致得到一个稀疏的dataframe,这时候最好用下PCA做降维处理。
      2. 如果某个特征有好几万个取值,那么用dummy variable就并不现实了,这时候可以用Count-Based Learning.
    [Read More...]
  • An Introduction to Stock Market Data Analysis with Python

    Here  are some best article for stock data analysis using python.
    An Introduction to Stock Market Data Analysis with Python (Part 1)

    from: https://ntguardian.wordpress.com/2016/09/19/introduction-stock-market-data-python-1/

    This post is the first in a two-part series on stock data analysis using Python, based on a lecture I gave on the subject for MATH 3900 (Data Science) at the University of Utah. In these posts, I will discuss basics such as obtaining the data from Yahoo! Finance using pandas, visualizing stock data, moving averages, developing a moving-average crossover strategy, backtesting, and benchmarking. The final post will include practice problems.

    [Read More...]
  • Good Articles to learn how to implement a neural network 1

    This series of post will list some good articles about how to implement a neural network. Thanks for the authors for the excellent work. 
    If you are the author and you don’t want your articles listed here. Please email to learn4master, we will remove it from the site. 
     

    How to implement a neural network Part 1

     From: http://peterroelants.github.io/posts/neural_network_implementation_part01/

    This page is part of a 5 (+2) parts tutorial on how to implement a simple neural network model. You can find the links to the rest of the tutorial here:

     

    The tutorials are generated from Python 2 IPython Notebook files,

    [Read More...]
Page 1 of 41234