• Good resources to learn how to use websocket push api in python

    How to connect to poloniex.com websocket api using a python library

    The problem:

    I am trying to connect to wss://api.poloniex.com and subscribe to ticker. I can’t find any working example in python. I have tried to use autobahn/twisted and websocket-client 0.32.0.
    The purpose of this is to get real time ticker data and store it in a mysql database.

    The solution:

    What you are trying to accomplish can be done by using WAMP, specifically by using the WAMP modules of the autobahn library (that you are already trying to use).

    [Read More...]
  • Cool Python tricks and tips

    Here are some cool tricks to write better python code:

    List comprehensions:

    Instead of building a list with a loop:

    We can often build it much more concisely with a list comprehension:

     

    Enumerate

    We can use enumerate to do a for loop:

    like this:

    Enumerate can also take a second argument. Here is an example:

     

    Dict/Set comprehensions

    dict/set comprehensions are simple to use and just as effective:

     

    [Read More...]
  • install and use tor on ubuntu for python requests

    Install tor on ubuntu:

    sudo apt-get install tor

    open vi /etc/tor/torrc, set up the ports:

    Then start tor:

    service tor start

    Check tor service

    Install stem and PySocks

    pip install PySocks
    pip install stem

    The following is the code to use tor as proxy in python. 

    [Read More...]
  • Python deal with encode with request

    this will prevent the coding problem of using requests to crawl website;

     

    [Read More...]
  • Good resources for html extraction using python

    Here are some good python tools to extract texts from html pages:

    1. http://www.clips.ua.ac.be/pages/patter

    Pattern

    Pattern is a web mining module for the Python programming language.

    It has tools for data mining (Google, Twitter and Wikipedia API, a web crawler, a HTML DOM parser), natural language processing (part-of-speech taggers, n-gram search, sentiment analysis, WordNet), machine learning (vector space model, clustering, SVM), network analysis and <canvas> visualization.

    pattern.web

    The pattern.web module has tools for online data mining: asynchronous requests, a uniform API for web services (Google, Bing, Twitter, Facebook, Wikipedia,

    [Read More...]
  • python multi thread example

    In python, it is easy to start multiple threads using the Thread class in the threading module.  The threading module is built on the low-level features of thread to make it easier to write multithreading program in python.  If you want to run multiple operations concurrently in python, you need to master the Thread class. 

    Thread Objects
    Create and start a Thread

    We can easily make several threads run concurrently using the Thread class. The syntax to create and start a thread is as follows :

    The jobs are defined in my_function,

    [Read More...]
  • Python Queue examples

    In this post, I will discuss how to use the python Queue module. This module implements queues for multiple thread programming. Specifically, the python Queue object can be easily used to solve the multi-producer, multi-consumer problem, where messages must be exchanged safely between multiple threads.  As the locking semantics have already been implemented in the Queue class, you don’t need to handle the low level lock, unlock operations, which can easily cause the dead lock problems.

    Note

    Tips: queue is one of the most widely used data structures in computer science.

    [Read More...]
  • run pyspark on oozie

     In this post, I first give a workable example to run pySpark on oozie. Then I show how to run pyspark on oozie using your own python installation (e.g., anaconda). In this way, you can use numpy, pandas, other python libraries in your pyspark program. 

    The syntax of creating a spark action on oozie workflow

    As described in the document, here are the meanings of these elements.

    The prepare element, if present, indicates a list of paths to delete or create before starting the job. Specified paths must start with hdfs://HOST:PORT .

    [Read More...]
  • pyspark unit test based on python unittest library

    pyspark unit test

    Pyspark is a powerful framework for large scale data analysis. Because of the easy-to-use API, you can easily develop pyspark programs if you are familiar with Python programming.

    One problem is that it is a little hard to do unit test for pyspark. After some google search using “pyspark unit test”, I only get articles about using py.test or some other complicated libraries for pyspark unit test. However, I don’t want to install any other third party libraries .  What I want is to set up the pyspark unit test environment just based on the unittest library,

    [Read More...]
  • python UnicodeEncodeError, converting unicode to ascii

    In python, we often encounter the unicode convert issue. For instance, when you try to print a unicode string, you will get the following exception:

    The reason is that  the str() function tries to convert the unicode string using ascii, which doesn’t support the character u’\xe6′.

    The solution is to convert the string into ‘utf-8’ encoding.

    recommended conversion workflow: input (any cp) -> convert to unicode -> (process) -> output to utf-8

    See the following two examples:

    Best practice:

    Always encode from unicode to bytes.

    [Read More...]
Page 1 of 3123