Step-by-Step Analysis of Crash Caused by Resize from Zero

Analyzing crashes caused by apache arrow

Fracis
3 min readMar 27, 2024

This section will guide you through debugging STL source code crashes from scratch. Using a real-world example of a crash caused by resizing during work, we’ll delve into systematic analysis to help you understand the process.

My pr: GH-40816: [C++] Security checks and relaxing hashjoin batch rows size.

Background

Recently, while attempting to modify the size of an Arrow batch to 65536, a crash occurred with the following error:

terminate called after throwing an instance of 'std::length_error'
what(): vector::_M_default_append

By capturing the exception with GDB, the abnormal position was identified. Upon examining the stack trace, it was found that the issue arose within the construction of the hash table’s partition array in the join operation:

prtn_state.key_ids.resize(num_rows_before + num_rows_new);

Thus, the problem transformed into: why does the resize operation trigger an exception?

Upon inspecting the STL code, two scenarios were discovered. Here’s a simplified version of the STL code for your reference:

if (__navail < __n) {
const size_type __len =
_M_check_len(__n, "vector::_M_default_append");
}
size_type _M_check_len(size_type __n, const char* __s) const {
if (max_size() - size() < __n)
__throw_length_error(__N(__s));
}

The most crucial function here is _M_check_len. Can you recall the two scenarios that might trigger this check?

  • Scenario 1: When there’s genuinely insufficient memory, exceeding the max_size of the vector, it throws this exception.
  • Scenario 2: If __n is passed as a negative number, due to it being of type size_t, it turns into an extremely large value, thus causing an exception.

Since Scenario 1 wasn’t applicable in our system upon inspecting memory, attention was shifted to Scenario 2. Initially, it was suspected that __n was a negative number. By implementing logging and testing, this suspicion was confirmed. It was observed that rows_new indeed became negative:

part id 15, dop_ = 105, prtnid + 1 ranges = 0, prtnid ranges = 61434, part size:0, rows_new: -61434, cap: 0

Now that the cause was identified, the next step was to analyze why a negative number was generated.

The num_rows_new is determined by partition ranges, and the following formula resulted in a negative number:

int num_rows_new =
locals.batch_prtn_ranges[prtn_id + 1] - locals.batch_prtn_ranges[prtn_id];

Further investigation led to the Eval function of PartitionSort, where several critical points were noted:

ARROW_DCHECK(num_rows > 0 && num_rows <= (1 << 15));

The first point was this assertion. Despite passing 65536, which is obviously greater than the 32768 specified here, the assertion didn’t succeed. It was later discovered that this was a release package, which only produced a warning rather than being fatal.

Continuing, a noticeable type uint16_t was encountered, used for calculating the sum. For num_rows_new to become negative, only two possibilities existed:

  • Scenario 1: locals.batch_prtn_ranges[prtn_id + 1] is less than locals.batch_prtn_ranges[prtn_id].
  • Scenario 2: locals.batch_prtn_ranges[prtn_id + 1] is negative and locals.batch_prtn_ranges[prtn_id] is either negative or positive but greater than the former.
uint16_t sum = 0;
for (int i = 0; i < num_prtns; ++i) {
uint16_t sum_next = sum + prtn_ranges[i + 1];
prtn_ranges[i + 1] = sum;
sum = sum_next;
}

From this code segment, it’s evident that Scenario 1 is ruled out since it’s an increment operation. The worst-case scenario is equality. Thus, only Scenario 2 remains, leading to a negative value. Overflowing uint16_t led to this, as the value, which we know is 65535, exceeded this limit, causing the issue!

With this, the current round of debugging and analysis comes to a close.

--

--

Fracis

Stories About C Plus Plus,Arrow and DuckDB contributor