Why is Node.js Crashing? A Deep Dive into a Tricky Bug

来源:互联网 时间:1970-01-01

In early September, we started seeing one of our backend services exit with the following error:

2015-09-10_23:28:11.75413 node: ../src/node_buffer.cc:226: v8::MaybeLocal<v8::Object> node::Buffer::New(v8::Isolate*, v8::Local<v8::String>, node::encoding): Assertion `(data) != (nullptr)' failed.

Typically, Node.js exits with a JavaScript stack trace. This error was at a lower level. Node.js was dying because of an assertion in its Buffercode. At worst, creating a new Buffer should throw an exception. This was a crash. This was a Node.js bug.

The first thing I did was look at the code containing the assertion error. This was in node_buffer.ccon line 225 . I’ve included the entire Buffer::New()constructor below:

209 MaybeLocal<Object> New(Isolate isolate,210 Local<String> string,211 enum encoding enc) {212 EscapableHandleScope scope(isolate);213214 size_t length = StringBytes::Size(isolate, string, enc);215 char data = static_cast<char>(malloc(length));216217 if (data == nullptr)218 return Local<Object>();219220 size_t actual = StringBytes::Write(isolate, data, length, string, enc);221 CHECK(actual <= length);222223 if (actual < length) {224 data = static_cast<char>(realloc(data, actual));225 CHECK_NE(data, nullptr);226 }227228 Local<Object> buf;229 if (New(isolate, data, actual).ToLocal(&buf))230 return scope.Escape(buf);231232 // Object failed to be created. Clean up resources.233 free(data);234 return Local<Object>();235 }

Somehow, datawas NULL. Thinking the realloc()on line 224 might have failed, I double-checked memory usage on the servers. They were not in danger of reaching any limits, and crashes didn’t seem to depend on memory usage. The service crashed while using 1GB of RAM just as often as it did using 100MB. Dang. Not an easy fix. It was a slim hope anyways. Modern OSes don’t return NULLfrom malloc()and friends. [1]

The next thing I did was man realloc, to try and figure out how it could return NULL. Except for an out-of-memory condition, it wasn’t possible. Even passing NULLto realloc()returned a usable chunk of memory:

If ptr is NULL, realloc() is identical to a call to malloc() for size bytes. If size is zero and ptr is not NULL, a new, minimum sized object is allocated and the original object is freed.

There was simply no way for that function, executed sequentially, to fail in this way. Therefore (I reasoned), the bug must be another thread modifying shared state. Probably a hard-to-reproduce race condition. Ugh. Still, I needed to fix the issue. Desiring to know more, I tried to build a reproducible test case. But I could only trigger the crash in production and staging, and only when copying lots of data between instances of our service. Having a lot of other stuff to do, I mitigated the issue by reducing the peak rate at which the service copied data.

A month later, Matt finally got tired of seeing crash emails. We paired to try and find the underlying cause. When I was explaining the issue to Matt, I pointed out that realloc()never returns NULL. He double-checked the docs and disagreed. When he linked to the manpage describing realloc()’s behavior, it said:

If size was equal to 0, either NULL or a pointer suitable to be passed to free() is returned.

Wait, what?! It turned out that I had run man reallocon my mac, while he had googled “realloc” and clicked on the first result. That result described realloc()on Linux. The two behaved differently, and that difference was crucial. Again, here are the relevant lines from Buffer::New:

220 size_t actual = StringBytes::Write(isolate, data, length, string, enc);221 CHECK(actual <= length);222223 if (actual < length) {224 data = static_cast<char*>(realloc(data, actual));225 CHECK_NE(data, nullptr);226 }

If realloc()returns null, actualmust be 0. If actualis 0, StringBytes::Write()must have returned 0. But to get to line 220, lengthmust be greater than 0. How could this happen? Well, lengthis only set on line 214:

214 size_t length = StringBytes::Size(isolate, string, enc);

So StringBytes::Size()thinks the buffer is a certain size, but StringBytes::Write()disagrees or fails in some way. To confirm this hypothesis, Matt and I added a printf()before the realloc():

220 size_t actual = StringBytes::Write(isolate, data, length, string, enc);221 CHECK(actual <= length);222223 if (actual < length) {224 printf("actual: %u length: %u/n", actual, length);225 data = static_cast<char*>(realloc(data, actual));226 CHECK_NE(data, nullptr);227 }

I deployed this custom Node.js build to staging, and soon saw crashes immediately preceded by lines such as:

2015-10-17_19:22:45.89086 actual: 0 length: 11518

Yahtzee! We were on the right track. Now how could StringBytes::Write()return 0? We both suspected base64-encoded buffers. Delving into string_bytes.cc , we saw that StringBytes::Size()called base64_decoded_size(), which called base64_decoded_size_fast(), which basically returned length / 4 * 3. [2] At no point did any of these methods check for valid base64 encoded data. They didn’t strip whitespace or invalid characters. They just multiplied by 0.75.

It was a different story for StringBytes::Write(). That function called base64_decode(), which called base64_decode_fast(), which could return early with invalid base64 data. In that case, base64_decode_slow()is called. Let’s take a look at that function, which starts at line 167 of string_bytes.cc :

167 template <typename TypeName>168 size_t base64_decode_slow(char dst, size_t dstlen,169 const TypeName src, size_t srclen) {170 uint8_t hi;171 uint8_t lo;172 size_t i = 0;173 size_t k = 0;174 for (;;) {175 #define V(expr) /176 while (i < srclen) { /177 const uint8_t c = src[i]; /178 lo = unbase64(c); /179 i += 1; /180 if (lo < 64) /181 break; /* Legal character. */ /182 if (c == '=') /183 return k; /184 } /185 expr; /186 if (i >= srclen) /187 return k; /188 if (k >= dstlen) /189 return k; /190 hi = lo;191 V(/ Nothing. /);192 V(dst[k++] = ((hi & 0x3F) << 2) | ((lo & 0x30) >> 4));193 V(dst[k++] = ((hi & 0x0F) << 4) | ((lo & 0x3C) >> 2));194 V(dst[k++] = ((hi & 0x03) << 6) | ((lo & 0x3F) >> 0));195 #undef V196 }197 UNREACHABLE();198 }

This macro-fied code may be a little hard to follow, but the behavior we care about is straightforward. Look at lines 182 and 183. Any =in the data causes the function to return early. It doesn’t matter if srcis a megabyte. If the first character is =, kis still zero when line 183 is hit. Once we figured that out, it wasn’t too hard to reproduce the issue in a line of JavaScript. Try this with Node.js (or io.js) from v3.0.0 to v4.2.1:

[email protected]:~% node> new Buffer("=" + new Array(10000).join("A"), "base64");node: ../src/node_buffer.cc:225: v8::MaybeLocal<v8::Object> node::Buffer::New(v8::Isolate*, v8::Local<v8::String>, node::encoding): Assertion `(data) != (nullptr)' failed.zsh: abort (core dumped) [email protected]:~%

Armed with a one-liner crash, I reported the issue to Node.jsand described how I thought it was breaking. It only took a day for Ben Noordhuisto fix the bug in master. Node.js v4.2.2 and v5.0.0 will have the fix. Mission accomplished!

Except, we forgot one thing. Where was this invalid base64 coming from? Why was our back-end service processing it? This post is pretty long, so that will have to wait.

Thanks to Matt Kaniaris, Ben Noordhuis, and the rest of the Node.js team for their help.

Even when a process asks for more memory than is available, modern OSes return a usable pointer. Only when the memory is accessed will the OS jump into action and free memory by killing processes.

Why length / 4 * 3instead of length * 0.75or (length * 3) / 4? This is C++, so the former requires a type conversion to float or double, followed by rounding. The latter could overflow if lengthis greater than SIZE_MAX / 3.