How to Troubleshoot Deadlocks in Concurrent Programming with C++?

Step-by-step uses gdb to locate and debug deadlock problems

Fracis
3 min readApr 11, 2024

Recently, while working on developing a parallel operator, nested loop join, within Apache Arrow, the issue of deadlocks emerged. Since parallelism was involved, the common problem of deadlocks arose. So, what do you do if you encounter a deadlock issue? How do you debug and locate it?

That’s precisely the aim of this article — to assist you in quickly mastering deadlock detection and debugging in concurrent programming. Let’s delve into it.

1. Introduction

To better explain deadlocks, let’s introduce a program.

std::mutex gMutex;

int t2() {
std::lock_guard<std::mutex> m(gMutex);
return 0;
}
int t1() {
std::lock_guard<std::mutex> l(gMutex);
return t2();
}

Looking at this program, it’s evident to most that there’s a problem — a deadlock! The issue lies in both the t1() and t2() functions attempting to lock the global mutex gMutex. However, after t1() locks it, it calls t2(), which tries to lock gMutex again internally.

t1 has already locked the mutex, but hasn't released it yet. Meanwhile, t2 tries to lock it again. Both are waiting for the other to release the lock, resulting in a deadlock. In actual project code, it's not as simple as this example; it's much more complex. For instance, in the code I wrote for Apache Arrow:

Status OnBuildSideFinished(size_t thread_index) {
std::lock_guard<std::mutex> guard(probe_side_mutex_);
// do something
accumulate_build_ready_ = true;
return scheduler_->StartTaskGroup(thread_index,task_group_probe_,queued_batches_to_probe_.batch_count());
}

This code is much more intricate than the previous scenario. It’s nested several layers deep in the stack, and there’s more code not shown here. However, both scenarios fundamentally represent a deadlock model.

Photo by Mackenzie Marco on Unsplash

2. Debugging

After understanding the deadlock model, how do you locate such issues?

Two methods can be employed here. The first is to run the program directly and then use gdb.

For example:

./a.out

Then, after finding the process ID:

gdb -p xxx

At this point, we can determine which thread is currently waiting.

(gdb) info threads 
Id Target Id Frame
* 1 Thread 0x7ffff7fe2740 (LWP 32301) "a.out" 0x00007ffff7bc8017 in pthread_join () from /lib64/libpthread.so.0
2 Thread 0x7ffff6fd0700 (LWP 32305) "a.out" 0x00007ffff7bcd54d in __lll_lock_wait () from /lib64/libpthread.so.0

Then, examining the stack of __lll_lock_wait, for example, for thread 2, we can see the line numbers of t1 and t2, directly pinpointing where the issue arises. It's very intuitive!

Apart from this method, you can also directly run the program using gdb. In this case, it will hang, and after killing it with ctrl + c, you can still get similar information as above.

For example:

(gdb) r
Starting program: /home/light/a.out
Missing separate debuginfos, use: debuginfo-install glibc-2.17-326.el7_9.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Hello World!
[New Thread 0x7ffff6fd0700 (LWP 32305)]
^C
Thread 1 "a.out" received signal SIGINT, Interrupt.
0x00007ffff7bc8017 in pthread_join () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install libgcc-4.8.5-44.el7.x86_64 libstdc++-4.8.5-44.el7.x86_64
(gdb) info threads
Id Target Id Frame
* 1 Thread 0x7ffff7fe2740 (LWP 32301) "a.out" 0x00007ffff7bc8017 in pthread_join () from /lib64/libpthread.so.0
2 Thread 0x7ffff6fd0700 (LWP 32305) "a.out" 0x00007ffff7bcd54d in __lll_lock_wait () from /lib64/libpthread.so.0

In conclusion, debugging deadlocks in concurrent programming requires a systematic approach, whether through direct program execution or using gdb for detailed analysis.

--

--

Fracis
Fracis

Written by Fracis

Stories About C Plus Plus,Arrow and DuckDB contributor

No responses yet