python multi thread example

Tags: , , ,

In python, it is easy to start multiple threads using the Thread class in the threading module.  The threading module is built on the low-level features of thread to make it easier to write multithreading program in python.  If you want to run multiple operations concurrently in python, you need to master the Thread class. 

Thread Objects

Create and start a Thread

We can easily make several threads run concurrently using the Thread class. The syntax to create and start a thread is as follows :

The jobs are defined in my_function, also called workers, which will be running after the start() method is called. 
 
In the following example, we create three threads, and two workers. Since the target for thread1 is  the function worker1, once thread1 is started, the function worker1 will be executed. 

The following is the output of the program.  We can see that after starting the three threads, the main program was not blocked. 

main: start to create three threads at: 18:39:06

Thread: t1 enter into worker1 at time = 18:39:06

Thread: t2 enter into worker2 at time = 18:39:06

Thread: t3 enter into worker2 at time = 18:39:06

Main: after submit three threads the time is: 18:39:06
The main starts to wait for the threads at 18:39:06
Thread: t1 exit worker1 at time = 18:39:11

Thread: t2 exit worker2 at time = 18:39:16

Thread: t3 exit worker2 at time = 18:39:16

Get the name of the Thread

As you can see from the above example, we can use threading.currentThread().getName() to get the current thread name. This makes it easy for you to manage your threads. 

Use multithreading to download urls

The above program can be improved to solve more complicated tasks. For example, suppose we have a set of URLs, and we want to download and store them into a database. We can use multiple thread programming to make the whole process faster.  The following program simulates the process of using multiple thread programming to download a set of urls. In order to transfer arguments to the target function, we use args = (url, ). Please note that the comma , is necessary, as the expected value for args is a tuple. 

The following is the output. 

Thread: thread-for-URL-0 start download URL-0 at time = 19:14:08

Thread: thread-for-URL-1 start download URL-1 at time = 19:14:08

Thread: thread-for-URL-2 start download URL-2 at time = 19:14:08

Thread: thread-for-URL-3 start download URL-3 at time = 19:14:08

Thread: thread-for-URL-4 start download URL-4 at time = 19:14:08

Thread: thread-for-URL-5 start download URL-5 at time = 19:14:08

Thread: thread-for-URL-6 start download URL-6 at time = 19:14:08

Thread: thread-for-URL-7 start download URL-7 at time = 19:14:08

Thread: thread-for-URL-8 start download URL-8 at time = 19:14:08

Thread: thread-for-URL-9 start download URL-9 at time = 19:14:08

Thread: thread-for-URL-10 start download URL-10 at time = 19:14:08

Thread: thread-for-URL-11 start download URL-11 at time = 19:14:08

Thread: thread-for-URL-12 start download URL-12 at time = 19:14:08

Thread: thread-for-URL-13 start download URL-13 at time = 19:14:08

Thread: thread-for-URL-14 start download URL-14 at time = 19:14:08

Thread: thread-for-URL-15 start download URL-15 at time = 19:14:08

Thread: thread-for-URL-16 start download URL-16 at time = 19:14:08

Thread: thread-for-URL-17 start download URL-17 at time = 19:14:08

Thread: thread-for-URL-18 start download URL-18 at time = 19:14:08

Thread: thread-for-URL-19 start download URL-19 at time = 19:14:08

Thread: thread-for-URL-0 finish download URL-0 at time = 19:14:13

Thread: thread-for-URL-5 finish download URL-5 at time = 19:14:13

Thread: thread-for-URL-8 finish download URL-8 at time = 19:14:13

Thread: thread-for-URL-9 finish download URL-9 at time = 19:14:13

Thread: thread-for-URL-3 finish download URL-3 at time = 19:14:13

Thread: thread-for-URL-4 finish download URL-4 at time = 19:14:13

Thread: thread-for-URL-2 finish download URL-2 at time = 19:14:13

Thread: thread-for-URL-1 finish download URL-1 at time = 19:14:13

Thread: thread-for-URL-6 finish download URL-6 at time = 19:14:13

Thread: thread-for-URL-7 finish download URL-7 at time = 19:14:13

Thread: thread-for-URL-10 finish download URL-10 at time = 19:14:13

Thread: thread-for-URL-12 finish download URL-12 at time = 19:14:13

Thread: thread-for-URL-17 finish download URL-17 at time = 19:14:13

Thread: thread-for-URL-14 finish download URL-14 at time = 19:14:13

Thread: thread-for-URL-13 finish download URL-13 at time = 19:14:13

Thread: thread-for-URL-19 finish download URL-19 at time = 19:14:13

Thread: thread-for-URL-16 finish download URL-16 at time = 19:14:13

Thread: thread-for-URL-15 finish download URL-15 at time = 19:14:13

Thread: thread-for-URL-18 finish download URL-18 at time = 19:14:13

Thread: thread-for-URL-11 finish download URL-11 at time = 19:14:13

Multiple producer Consumer problem

One problem of the above program is that we have as many threads as urls. This will not work if there are too many urls. A better method is to use the Queue Module to create a threading pool. This means a limited number of threads are used to download urls. 

When downloading urls, there are actually two kinds of tasks: 1. produce urls; 2. get urls and download them. This is actually a kind of multi-producer and multi-consumer problem. The basic characteristics of the producer consumer problem is that:

The producer keep generating new tasks (i.e., urls) and store them into the bounded queue. If the queue is full, the producer has to wait. 

The consumer (the crawler here) keep getting the  tasks (urls) from the queue for consuming (downloading). If the queue is empty, the consumer has  to wait.

 

There are many methods to solve this producer and consumer problem, the easiest one is based on Python Queue Class. One advantage of using the python Queue class is that we don’t need to deal with low level synchronization issues in multiple thread programming. What we need to do is to create an Queue object with a maximum size. Then the producer threads try to add new items to the queue by calling put() method, if the queue is full, it will wait the consumers call task_done() method. The consumer threads keep polling items from the queue by calling the get() method, once the item is processed, the consumer will call the task_done() method to indicate the producers that a new free slot is available. 

In the following example, we use Queue to implement the  producer consumer pattern.  If you are not familiar with python queue, please refer to this post for examples on how to use python Queue class.

The output:

Main: start crawler threads at 22:30:45
Main: start producer threads at 22:30:45
Thread: url_producer-1 start put url Domain-B-URL-0 into url_queue[current size=0] at time = 22:30:50

Thread: url_producer-1 finish put url Domain-B-URL-0 into url_queue[current size=1] at time = 22:30:50

Thread: Thread-0 start download Domain-B-URL-0 at time = 22:30:50

Thread: Thread-0 finish download Domain-B-URL-0 at time = 22:30:52

Thread: url_producer-1 start put url Domain-A-URL-0 into url_queue[current size=0] at time = 22:30:53

Thread: url_producer-1 finish put url Domain-A-URL-0 into url_queue[current size=1] at time = 22:30:53

Thread: Thread-1 start download Domain-A-URL-0 at time = 22:30:53

Thread: url_producer-1 start put url Domain-B-URL-1 into url_queue[current size=0] at time = 22:30:56

Thread: url_producer-1 finish put url Domain-B-URL-1 into url_queue[current size=1] at time = 22:30:56

Thread: Thread-2 start download Domain-B-URL-1 at time = 22:30:56

Thread: url_producer-1 start put url Domain-A-URL-1 into url_queue[current size=0] at time = 22:30:58
Thread: Thread-1 finish download Domain-A-URL-0 at time = 22:30:58

Thread: url_producer-1 finish put url Domain-A-URL-1 into url_queue[current size=1] at time = 22:30:58

Thread: Thread-3 start download Domain-A-URL-1 at time = 22:30:58

Thread: Thread-3 finish download Domain-A-URL-1 at time = 22:31:00

Thread: Thread-2 finish download Domain-B-URL-1 at time = 22:31:02

Thread: url_producer-1 start put url Domain-B-URL-2 into url_queue[current size=0] at time = 22:31:02

Thread: url_producer-1 finish put url Domain-B-URL-2 into url_queue[current size=1] at time = 22:31:02

Thread: Thread-4 start download Domain-B-URL-2 at time = 22:31:02

Thread: url_producer-1 start put url Domain-A-URL-2 into url_queue[current size=0] at time = 22:31:04

Thread: url_producer-1 finish put url Domain-A-URL-2 into url_queue[current size=1] at time = 22:31:04

Thread: Thread-0 start download Domain-A-URL-2 at time = 22:31:04

Thread: Thread-4 finish download Domain-B-URL-2 at time = 22:31:05

Thread: Thread-0 finish download Domain-A-URL-2 at time = 22:31:06

Thread: url_producer-1 start put url Domain-B-URL-3 into url_queue[current size=0] at time = 22:31:12

Thread: url_producer-1 finish put url Domain-B-URL-3 into url_queue[current size=1] at time = 22:31:12