Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setInterval callback function unexpected halt #22149

Closed
newbreach opened this issue Aug 6, 2018 · 57 comments
Closed

setInterval callback function unexpected halt #22149

newbreach opened this issue Aug 6, 2018 · 57 comments
Labels
timers Issues and PRs related to the timers subsystem / setImmediate, setInterval, setTimeout.

Comments

@newbreach
Copy link

There is a setInterval method in the express project. When the project has been running for a month or so, there is no error, the project does not stop, but the callback of the setInterval is no longer executed.

like this :
setInterval(function(){
//do samething
console.log(Date.now());
},500);
A month later, no output is available, but the application process is robust.

@ChALkeR ChALkeR added the timers Issues and PRs related to the timers subsystem / setImmediate, setInterval, setTimeout. label Aug 6, 2018
@tkarls
Copy link

tkarls commented Aug 6, 2018

This is interesting. I have an app with some code looking like this:

function doSomething() {
    promiseReturningFunction().then((data) => {
        //do some
    }).catch((error) => {
        //log error
    }).then(() => {
        setTimeout(doSomething, 5000);
    });
}

This is just meant to run these code blocks forever. Then suddenly a few installations stopped working recently. The app still responds to api requests and new timers that are set during for example an api call is executed.

However, the doSomething functions are no longer being exectuted. It is as if the event queue at some point just was emptied or something.

My installations that are having this problem are running node 10.4.1 or 10.6.0 depending on when they were upgraded.

A bug in the core nodejs is not my first thought but I cant find a logical explanation for my app to suddenly drop 5-6 timer constructs like this that has worked for a few years before....

@targos
Copy link
Member

targos commented Aug 6, 2018

@nodejs/timers

@benjamingr
Copy link
Member

For what it's worth you are not supposed to rely on timers staying running for a month. That said, I haven't run into this sort of thing before - even on Node servers with an uptime of more than a month.

@tkarls

The app still responds to api requests and new timers that are set during for example an api call is executed.

It is very likely that promiseReturningFunction never fulfilled or that you are returning a promise that was pending forever from either the catch or the then below it - or the then below is never executed.

Also note that is not a great way to write Node.js code (and this is core's fault IMO). Are you using async_hooks and some form of instrumentation? Can you check the logs for unhandledRejections?

@tkarls
Copy link

tkarls commented Aug 6, 2018

@benjamingr yea that would of course be the obvious conclusion and what I initially thought until I realized it wall all timers at once. I just can't figure out how that problem suddenly occurred on 6 different places in the code.

The "background tasks" are not depending on each other and hasn't been changed for a loong time. Had it been one timer that stopped that's one thing but all at the same time? This was also on 7 different servers running the same app in different locations and with a similar up-time. Like 3-4 weeks or so until the issue occurred.

Perhaps there are a better construct to keep some "background tasks" executing but I thought this was simple enough to do its job.

Not familiar with the async_hooks and logs are not saved (which is a flaw...) :(

@benjamingr
Copy link
Member

benjamingr commented Aug 6, 2018

@tkarls to be clear, I am not accusing you of writing code problematically - I am saying Node should guide users to safer ways to write code and make them "obvious".

If there is a way to get the "unhandledRejection" warnings then there is a very high chance you'll see it there. I've seen this before several times:

  • server runs for days.
  • there is a networking "hickup" and stuff fails for just a second or two.
  • recurring tasks fail, but since Node doesn't exit on promise unhandled rejection warnings the server chugs on.
  • Many people don't log warnings in production.

I'm getting "off" timer land and into "promise debugging land" here - just saying this is entirely on our radar and something we're thinking and talking about improving. Also it's possible your problem was entirely different and we do have a bug in setInterval - so take everything I say (guess) with a grain of salt.

Perhaps there are a better construct to keep some "background tasks" executing but I thought this was simple enough to do its job.

Typically, not on the server and using something that does scheduling - either what your cloud provides, crons if you're on a VM, k8s/cron jobs etc. I also warmly recommend turning on warnings (in production) for safety and using an APM (application performance monitoring) tool and setting up alerts for uptime and certain tasks.

Again, I strongly believe this is an issue with Node's user experience (not your or most people's code) and I feel strongly that Node (well, us) should fix it.

@newbreach
Copy link
Author

As tkarls says, I have a similar situation; @benjamingr thank you for your advice on coding, but I can't use cron because I use doSomething to trigger about 400-500 milliseconds.

@benjamingr
Copy link
Member

@newbreach in a similar situation - what sort of code do you have in the setTimeout?

@tkarls
Copy link

tkarls commented Aug 6, 2018

@benjamingr no problem, I didn't feel particularly 'accused' because it it was an easy fix I could do to my code I want to know about it so I can fix the problem :D

Thanks for your advises. Using external schedule trigging had not even occurred to me! Today the node process is kept alive using a systemd service that limits memory and tasks and restarts the node process if needed while ensuring always up. I'm sure there are more advanced monitors.

But back to the issue at hand. I hear what you are saying about unhandled rejections but I do have that catch before the then that sets the timer. So any error or exception should be caught I think (I have double checked that all promises are returned and not just created locally too).

One of my "workers" query an api over the network. Another checks a local database and talks to a local service. A third one is querying the NTP status on the server etc. Most of the functions are rather small and easy to overview. So I find it improbable (but I suppose not impossible) that all these functions have a bug in them that manifest at the same time. This happened on several servers during the summer (running 10.4.1-10.6.0) but not on any of our 60+ servers running <= 9.8.0. Granted our sw is also newer on the newer installations but a few of the "workers" hasn't been changed in well over a year.

The only thing that doesn't indicate a bug/change in setTimeout/setInterval to me is probably that there isn't 500 people reporting it already due to node being so popular. Which I can agree is a pretty strong indication too...

I do appreciate you insights and guesses though. That is all I'm doing right now too :)

@newbreach
Copy link
Author

newbreach commented Aug 6, 2018

@benjamingr

My code logic is probably like this

  1. accept the HTTP request and send the message to the TCP client.

  2. the request callback method is added to the queue.

  3. receive the message of TCP client. If a reply is received, the callback method is removed from the queue and the callback is executed. The content of the callback method is "response.write" ({"success": true}).

  4. timer checks whether the request message is more than 40 seconds. If it is over, remove the callback method and execute it.

The part of the code is as follows:

setInterval(function () {
  let cbs = [];
  for (let key of callbackQueue.keys()) {
    let item = callbackQueue.get(key)
    if (item && Date.now() - item.time >= 40000) {
      callbackQueue.delete(key);
      cbs.push(item);
    }
  }
  for (let i in cbs) {
    cbs[i].cb.call(cbs[i].thisArg, new util.BusinessError('network fail', 4001));
  }
}, 200);

thanks!

Edit by @Fishrock123 - code formatting

@newbreach
Copy link
Author

Click the mistake, I'm sorry

@newbreach newbreach reopened this Aug 6, 2018
@benjamingr
Copy link
Member

benjamingr commented Aug 6, 2018

I'm trying to reproduce this (with NODE_DEBUG=*, and the code setInterval(() => console.log('ping', Date.now()), 500);) - I'll see if I can hit it after a day or two. Otherwise I'm not sure how I can do "time dilation" and run it faster in a VM - I'll look into that.

Edit: there appears to be a way to test it in virtualbox if anyone is interested in taking a stab sooner: https://superuser.com/a/1022317/248212

@Fishrock123
Copy link
Member

This just became a problem recently? Like, in Node 10 ... or something else?

@apapirovski
Copy link
Member

I'm not seeing anything in the 10.x code that jumps out at me but will continue investigating and try to reproduce.

@tkarls
Copy link

tkarls commented Aug 8, 2018

@Fishrock123 Well for me I have not experienced anything like this on node <= 9.8
The next version I have used after that is 10.4.1 and there I got this problem after a few weeks (3-4) of run-time on several servers.

Today I will try and attach a debugger to one of my failing servers. I said before that new timers set by setTimeout() in api calls were executed correctly. This may have been a premature conclusion. Because it looks like this is not the case in two other API calls. I don't seem to get an error thrown though since the API call ends with success. Just the scheduled task does not run.

I will add any info I get from the debugging here

@benjamingr
Copy link
Member

Code is running with all the debug stuff for 2 days now here - not sure how to debug it otherwise - if anyone has any experience "speeding time" on an emulated system please chime in.

@tkarls
Copy link

tkarls commented Aug 8, 2018

Ok, so I have successfully been able to connect the debugger over a ssh tunnel. I have 3 timers I can access via the api. 2 is not working in the sense that I can step the code and see that setTimeout() is indeed called with sensible values and does not throw. But the callback is never called. One callback is working as expected which is a bit strange.

I called process._getActiveHandles(); before the setTimeout of the working timer and got 21 handles. After the call to setTimeout() I got 22 handles and the last handle in the list is my new timer which I can see from the msecs value.

Doing the same thing on a setTimeout that is not working results is:
before setTimeout I have 23 active handles.
after setTimeout I have 23 active handles and the last one in the list is NOT my timer. Probably why it is never called.

I do have several timers in the list already with timer values that I do recognize and the only thing I can see is really different from the timer that is still working and the one that is not is the timeout value. The working timer uses a unique timeout value only used for that timer. The timers that isn't working is using the same timeout values as several of my "background timers" is using. Not sure if this is a coincidence or not... but I can also see in the debugger that one and exactly one background timer is still working. That one also has a unique timeout value that no other times has in the code. All other timers uses "common values" such as 3000, 5000, 10000 and 30000 and more than one timer uses each value.

sounds pretty strange to me...

@Fishrock123
Copy link
Member

Fishrock123 commented Aug 8, 2018

That sounds like a rescheduling/reuse/re-enroll bug. (tl;dr some path causes pooled timers to not execute or schedule)

@apapirovski You're probably most familiar with this code now, any ideas?

@apapirovski
Copy link
Member

Nothing off the top of my head. Will investigate.

@apapirovski
Copy link
Member

@tkarls can you elaborate on "background timers"? Are these using unref or something? Any chance either me or @Fishrock123 could get access to the codebase? It would make it easier to know what I should be looking for in our code.

@tkarls
Copy link

tkarls commented Aug 9, 2018

@apapirovski
With background timer I mean that It's basically a function that does some task and then reschedule itself using setTimeout(). Not using unref(), only setTimeout and sometimes clearTimeout

Like this:

function doSomething() {
    promiseReturningFunction().then((data) => {
        //do some
    }).catch((error) => {
        //log error
    }).then(() => {
        setTimeout(doSomething, 5000);
    });
}

I call the function once when the app is lifted and then they are supposed to keep running without being poked from the outside. Hence "backgrund task" but I never use anything else than setTimeout to do the scheduling.

The api timer for instance is set in response to an api call. For instance I have a call that enables the firewall that sets an timer to auto disable it unless the user confirms the action within 30 seconds. This timer is not working. The relevant code is as follows;

enableFirewall: (options) => {
        return getFwStatus().then((before) => {
            return execp('sudo ufw --force enable').then(()=>{
                setDisableTimer(disableTime);
            }).then(()=>{
                return getFwStatus().then((after) => {
                    options.before = before;
                    options.after = after;
                    return addAuditLogEntry(options);
                }); 
            });
        }).then(()=>{
            return disableTime;
        }).catch( (error) => {
            options.name = 'firewall-error';
            return disableFirewall(options).catch(() => {
                return execp('sudo ufw disable')
            }).then(() => {
                return Promise.reject(error);
            });
        });
    }

function setDisableTimer(timeout) {
    cancelDisableTimer();
    disableFirewallTimer = setTimeout(() => {
        disableFirewall({ user: '--disable timer--'});
    }, timeout);
}

function cancelDisableTimer() {
    if (disableFirewallTimer != null) {
        clearTimeout(disableFirewallTimer);
        disableFirewallTimer = null;
    }
}

function disableFirewall(options) {
    var options = options || { user: '--unknown user--' }
    return getFwStatus().then((before) => {
        return execp('sudo ufw disable').then(() => {
            return getFwStatus().then((after) => {
                options.before = before;
                options.after = after;
                return addAuditLogEntry(options);
            });
        });
    });
}

The api call will call enableFirewall and I do see when stepping the code that it reaches and executes without error

disableFirewallTimer = setTimeout(() => {
        disableFirewall({ user: '--disable timer--'});
    }, timeout);

and that the timeout parameter indeed is set to 30000 as I expected it to be. But a breakpoint in the anonymous function never get hit (and the firewall doesn't get called either).

One of the "background tasks" that are not working any more

function keepNtpStatusFresh() {
    getNtpStatusOutputInternal().then((data) => {
        lastNtpCheck.data = data;
        lastNtpCheck.when = new Date();
    }).catch((error) => {
        lastNtpCheck.data = null;
        sails.log.error(error);
    }).then(() => {
        setTimeout(keepNtpStatusFresh, 5000);
    });
}

function getNtpStatusOutputInternal() {
    return execp(sails.config.paneda.ntpStatus, { timeout: 30000 }).then((result) => {
        var re = new RegExp(/^(.)([^\s=]+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)/gm);
        var match = re.exec(result.stdout);

        var servers = [];
        while (match != null) {
            servers.push({
                type: match[1],
                remote: match[2],
                refid: match[3],
                st: Number(match[4]),
                t: match[5],
                when: Number(match[6]),
                poll: Number(match[7]),
                reach: Number(match[8]),
                delay: Number(match[9]),
                offset: Number(match[10]),
                jitter: Number(match[11]),
            });
            match = re.exec(result.stdout);
        }
        return { output: result.stdout, serverList: servers };
    });
} 

@benjamingr
Copy link
Member

ping 1533808667121
TIMER 61156: 500 list wait because diff is -1
TIMER 61156: timeout callback 500
TIMER 61156: now: 116841691

My test based on the above is still running for several days now by the way.

@tkarls
Copy link

tkarls commented Aug 9, 2018

Yes, also since our servers were restarted they are working fine for little over 10 days now...

I have tried to check when the servers were upgraded and when the error showed up on each server. It looks like they all failed after about 25 days of run-time.

If this was a totally random problem or some sort of race condition I would have expected one server to fail after a few days, one to fail after a few weeks and perhaps 1 should still be running. 100% of the upgraded servers did fail. They are also running in completely different environments.

My brain can't help noticing that 25 days is almost exactly the number of milliseconds that can fit into a signed 32 bit integer (2 160 000 000 vs 2 147 483 647 ). Not sure if this is a pure coincidence (especially since Javascript doesn't have signed 32 bit integers AFAIK) but I'm mentioning it anyway :)

@benjamingr
Copy link
Member

@tkarls as far as I know livuv uses 64bit ints for timers from what I can tell and Node just calls into libuv for checking.

Maybe @cjihrig or @saghul will know? (since libuv timers)

(We do cap the timeout itself to 32bit integers but not the number of repeats).

I wonder if this correlates with a libuv update - I'd try and bisect but for a bug that takes ~25 days to reproduce I don't think it'll help.

@tkarls
Copy link

tkarls commented Aug 9, 2018

@apapirovski @benjamingr @Fishrock123

I was just able to reproduce this with a smaller example 😁

I used libfaketime to speed things up by a factor of 1000 on a linux server and executed a test program containing functions scheduling themselves in the callback and printing the current date. The simulated start date is today (2018-08-09) and the last print is in september 03 wich happened to be 25 days!

Here is the script for reference:

function timer1 () {
    console.log('timer1' + new Date().toString());
    setTimeout(timer1, 500);
}

function timer2 () {
    console.log('timer2' + new Date().toString());
    setTimeout(timer2, 500);
}

function timer3 () {
    console.log('timer3' + new Date().toString());
    setTimeout(timer3, 500);
}

function timer4 () {
    console.log('timer4' + new Date().toString());
    setTimeout(timer4, 500);
}

function timer5 () {
    console.log('timer5' + new Date().toString());
    setTimeout(timer5, 1500);
}

function timer6 () {
    console.log('timer6' + new Date().toString());
    setTimeout(timer6, 500);
}

function timer7 () {
    console.log('timer7' + new Date().toString());
    setTimeout(timer7, 500);
}

function timer8 () {
    console.log('timer8' + new Date().toString());
    setTimeout(timer8, 500);
}

function timer9 () {
    console.log('timer9' + new Date().toString());
    setTimeout(timer9, 500);
}
function timer10 () {
    console.log('timer10' + new Date().toString());
    setTimeout(timer10, 500);
}

timer1();
timer2();
timer3();
timer4();
timer5();
timer6();
timer7();
timer8();
timer9();
timer10();

In the middle of september 3 the output stopped. No error other output in the console.

Here is how I started it:
faketime -f "+10 x1000" node timer_test

@benjamingr So, libuv... I'm not at all familiar with that lib. But I suppose the bug could be there instead...

@apapirovski
Copy link
Member

I think I know the bug... confirming right now. Ugh.

@benjamingr
Copy link
Member

@tkarls excellent work isolating it. libuv is the underlying i/o library Node uses (it was written for Node but is used by others now) - it sounds like Anatoli has some idea about where the bug is though :)

@toymachiner62
Copy link

Do we know what versions of Node this is affecting? We're currently on 9.x and was going to start implementing some polling using setTimeout/setInterval and i'm worried we might hit this issue.

Also do we know if it's only when using setInterval for a long period of time? In my case we will use setInterval along with clearInterval and i'm curious if that prevents this issue from occurring.

@peernohell
Copy link

Hi,
We have some similar issue but with node 8. Does this version is impacted by this bug and will it get a fix in a minor version ?
Thanks ❤️

@newbreach
Copy link
Author

Hi,
We have some similar issue but with node 8. Does this version is impacted by this bug and will it get a fix in a minor version ?
Thanks ❤️

升级到10.9.0以上吧,就是18年8月15日以后的版本都可以

@Fishrock123
Copy link
Member

This is fixed in Node 10.9.0

@toymachiner62
Copy link

This is fixed in Node 10.9.0

@peernohell I guess we're not getting the answer to our questions 😆

@nhathongly
Copy link

I’ve got the same issue with node.js 12.x. Either setInterval or setTimeout unexpectedly stop without any error after some few weeks.

@adams-family
Copy link

@tkarls Excellent job isolating the problem!
@benjamingr Thanks for the fix!

We know that Node 10.9 fixes the issue, what about Node 12 and Node 14, can you please comment on those?

@paynebc
Copy link

paynebc commented Aug 20, 2020

I am having the same issue with timeouts in Node 12.18.3
After a couple of days, it just stops working without any error. Doesn't seem to be 25 days though

@adams-family
Copy link

@paynebc Any chance that you could reproduce this with libfaketime as tkarls did above?

#22149 (comment)

@paynebc
Copy link

paynebc commented Aug 20, 2020

I will give this a try

@tkarls
Copy link

tkarls commented Aug 20, 2020

FYI:

I ran my script above with libfaketime again and it does not detect any problem on node 12.18.3
Still running and scheduling stuff for "3 years" now.

@adams-family
Copy link

Interesting, it must be a different problem then. I'm thinking about renting a VPS and running like 5-10 major versions of NodeJS in parallel with libfaketime. Would that bring a value to anyone?

@paynebc
Copy link

paynebc commented Aug 20, 2020

I tried the script above now, too. It does indeed run totally fine, so whatever it is I have must be of different origin. But it does show exactly the same symptoms: every couple days, my main loop just stops working. I am using the following function and calling it with await in my main loop instead of passing my mainloop function directly to setTimeout

exports.sleep = function(millis) {
return new Promise(resolve => setTimeout(resolve, millis));
};

I am now running my application using NODE_DEBUG=* in the hopes of finding anything useful.

@adams-family
Copy link

It's interesting because you are using setTimeout as opposed to setInterval in this topic. Do I understand correctly that this is what your code looks like?

while (true) {
  await sleep(500);
  // …
}

Are you absolutely sure that it is the sleep function where your code hangs? (Such as by putting a console.log before and after the sleep function?) Perhaps could it be something else in your loop?

@paynebc
Copy link

paynebc commented Aug 20, 2020

yes this is how it works. I now added some more logging to find out if this is indeed where it hangs. I will report back once I have an idea what it might be

@adams-family
Copy link

Ok, sounds good!

FYI actually the reason why I used setInterval instead of while & await & setTimeout was because I thought that setInterval will be rock solid reliable as opposed to an async while loop where any uncaught exception/rejection can break the flow. In the end it turned our that setInterval wasn't rock solid either, nw, just fyi.

@paynebc
Copy link

paynebc commented Aug 20, 2020

yes I originally wanted to do it that way, too. but I was not sure what happens, when one loop iteration takes longer than my interval time. Actually I guess I do know what happens and I don't think it would be a good idea :) This way I do not have a constant wait time because the interval will be whatever the iteration uses +500 ms, but that is fine, I don't care about exact times, I just don't want the loop to run constantly

@adams-family
Copy link

I understand, it makes sense!

I think what would happen depends whether your loop is synchronous or asynchronous?

  • Async: setInterval ticks get get scheduled in a deferred event loop just as any other async operations -> individual operations will mix up and you will have multiple instances of your loop running seemingly in parallel
  • Sync: setInterval ticks get still scheduled in a deferred even loop but because the thread is busy due to a blocking sync operation, they will get scheduled at the end of the loop -> your code will still be synchronous but your event loop will grow indefinitely and you will run out of resources sooner or later IMO

Therefore yes, I think your approach is better for what you want to achieve :)

@paynebc
Copy link

paynebc commented Aug 20, 2020

the work that is happening in the loop is asynchronous. so yes, I will get multiple instances running in parallel and I would very much want to avoid that

@ThabetSabha
Copy link

@paynebc

I am having the same issue with timeouts in Node 12.18.3 After a couple of days, it just stops working without any error. Doesn't seem to be 25 days though

Hi, I am currently facing the same issue with v12.18.1, did you manage to find a way to replicate the issue? I am gonna switch to using node-cron, but I wanna make sure it's setInterval that's causing the issue, and not some other part of the process.

I opened an issue here: #41037

@nhathongly
Copy link

@ThabetSabha as what I know the issue is still there in node 14, and we’ve used node-cron to replace most long-running setInterval in our projects for more than one year, and it’s been working well until now

@ThabetSabha
Copy link

@nhathongly great, that's what I will be doing, did you manage to figure out a way to replicate the issue though?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
timers Issues and PRs related to the timers subsystem / setImmediate, setInterval, setTimeout.
Projects
None yet
Development

Successfully merging a pull request may close this issue.