Using the command line interface of debug

Session 10: traverse - a deadlocked program

The program traverse takes directory names as arguments; for each directory, it creates a separate thread that reads the contents of that directory and writes the names to a shared list; it then prints the filenames when all the directories have been traversed.

Compiling the multithreaded program

To compile the program, you need to link with the threads library. Use the following cc options:

   $ cc -g -o -Kthread traverse traverse.c
This creates the executable traverse.

Testing traverse

Execute the program.

   $ traverse /var/tmp /var/cron
After waiting a while, notice that the program just sits there with no response. Kill the process and run it again, but this time, in the background.
   $ traverse /var/tmp /var/cron &
   [1]     17752

Grabbing the hung program

Invoke debug, and grab the process you just created.

   $ debug
   debug> grab 17752
   New program traverse (process p1) grabbed
   Created 3 thread(s) for process p1
   HALTED p1.1 [_thr_swtch]
           0xbff8c4f2 (_thr_swtch+466:)     addl     $0x14, %esp
Notice that debug informs you that there are three threads in the process it grabbed. The source line displayed is for the current thread p1.1.

Examining the thread states

Enter the ps command and see what the current states of the threads are:

   debug> ps
   Program     ID      Pid Thread State      Function   Location       Command
   *traverse   p1.1   17752    1 Off LWP    _thr_swtch 0xbff8c4f2   traverse /var/tmp /var/cron
    traverse   p1.2   17752    2 Off LWP    _thr_swtch 0xbff8c4f2   traverse /var/tmp /var/cron
    traverse   p1.3   17752    3 Stopped    _block     0xbffd259a   traverse /var/tmp /var/cron
This tells you that traverse has three threads: the main thread p1.1 and two created threads p1.2 and p1.3. This was expected since you had given the executable two directory paths (/var/tmp and /var/cron) to traverse.

NOTE: The main thread (thread 1) of a grabbed process will not always be pn.1. The numbering is based on the order of the thread list provided to the debugger by the thread library.

The information you are interested in is that given by the ``State'' field. When a multiplexed thread waits for a resource (in this case, the list of filenames), it is taken off the LWP if that resource is not yet available. Note that two of the threads are taken off the LWP. The third thread is blocked, probably waiting for one of the other threads to continue.

NOTE: Multiplexed threads have the flexibility to get bound to any available LWP. When a multiplexed thread, while on a LWP, waits for a resource, it may be taken off that current LWP to give way to other threads. When the resource becomes available, the waiting thread may then be picked up by an available LWP for execution. A bound thread is always attached to an LWP and never gets off that LWP.

Viewing the stack traces of threads

Do a stack trace. To get the stack traces off all three threads, you need to specify the -p all option to the stack command:

   debug> stack -p all
   Stack Trace for p1.1, Program traverse
   *[0] _thr_swtch(presumed: 0x1, 0xbff90d30, 0)     [0xbff8c4f2]
    [1] _thr_cond_wait(presumed: 0xbff914a4, 0, 0xbff91a98) [0xbff872be]
    [2] thr_join(0, 0, 0x80477e4)   [0xbff83d17]
    [3] main(argc=3, argv=0x8047814, 0x8047824)      [traverse.c@162]
    [4] _start()     [0x80486f4]

Stack Trace for p1.2, Program traverse *[0] _thr_swtch(0x1, 0xbff77cc4, 0x8049c74) [0xbff8c4f2] [1] mutex_lock(0x8049c6c) [0xbff8dc02] [2] add_to_list(name="/var/tmp/") [traverse.c@61] [3] traverse(info=0x804a058, presumed: 0, 0) [traverse.c@111] [4] _thr_start() [0xbff8b6d2]

Stack Trace for p1.3, Program traverse *[0] _block(0) [0xbffd259a] [1] __lwp_cond_wait(0xbff72f1c, 0xbff91ad0, 0) [0xbffea5b9] [2] _lwp_cond_wait(presumed: 0xbff72f1c, 0xbff91ad0, 0xbff95b14) [0xbffea63a] [3] _thr_disp(0xbff72cc4) [0xbff8bfe7] [4] _thr_swtch(0x1, 0xbff72cc4, 0x8049c74) [0xbff8c3ee] [5] mutex_lock(0x8049c6c) [0xbff8dc02] [6] add_to_list(name="/var/cron/log") [traverse.c@61] [7] traverse(info=0x804a068, presumed: 0, 0) [traverse.c@111] [8] _thr_start() [0xbff8b6d2]

Diagnosing the problem

Threads p1.2 and p1.3 execute the same instructions but on different directory paths. The stack traces reveal that they both have a call to mutex_lock. If you examine the source code, mutex_lock is called within the add_to_list function. Since the list is a shared resource, the intention is to lock this list resource to prevent other threads from simultaneously updating it. Once a thread is done with the resource, it has to unlock it so that other threads can access it.

So, why are the threads p1.2 and p1.3 taken off the LWP? Since p1.2 had first access to the list resource, it has to release the lock on the resource to allow p1.3 access to the list. Examining the source code further shows that there is no call to unlock the mutex.

Next topic: Some notes on multithread debugging
Previous topic: Debugging a multithreaded program

© 2004 The SCO Group, Inc. All rights reserved.
UnixWare 7 Release 7.1.4 - 27 April 2004