Subject: Two questions: 1. Callback function not called; 2. TCP connecting to un-existing server prevents from using existing server.

Two questions: 1. Callback function not called; 2. TCP connecting to un-existing server prevents from using existing server.

From: Anlin Zhang <anlin.zhang_at_gmail.com>
Date: Wed, 8 Aug 2007 18:10:56 -0700

Hi,

We are trying to use c-ares in a sort of mission-critical application, but
encountered some problems. Want to know if you have solutions.

Two problems:
Problem 1: Not all queries get callback after calling 200 ares_query(). UDP
packets might lost but shouldn't there be a callback informing timeout? Or
do I need set some channel options to get all callbacks?

Problem 2: If force "Always use TCP" and primary DNS host machine is
shut-down, tcp setup to primary DNS server takes more than a minute to give
up before trying to connect to Secondary DNS server (shutdown means not just
kill the named process but power-down the server, this makes big difference
since there is no TCP stack running to reply tcp SYN). I did a quick fix of
this problem, now it can immediately switch to the secondary DNS server. I
also applied tcp keepalive to the socked to detect unplugged network cable.
Want you to review the change (I only considered compiling in linux here,
tcp keepalive might not be portable). I'm still having "Problem 1" with TCP
if I shut-down and re-start Primary or Secondary DNS server randomly.

Problem 1 Details:
 With default channel setting (ares_init) and two DNS servers in
/etc/resolv.conf.
 calling ares_query() 200 times for 200 host names (I see ares use UDP to
send queries).
 loop{
  ares_fds(),
  ares_timeout(),
  select(),
  process(),
 }
 after looping many times, only part of the 200 hostnames get callback, rest
seems just lost, not even callback with timeout.

Problem 2 Details:
 With optional settings in Channel (ares_init_options), set opts.flags =
ARES_FLAG_USEVC | ARES_FLAG_NORECURSE | ARES_FLAG_STAYOPEN ;
 Power-down the Primary DNS server (or use iptables to block tcp packets).
 calling ares_query() 200 times for 200 host names.
 ares try unblocking connect to Primary DNS server, since getting no ACK on
SYN, it starts the TCP handshake timeout algorithm which could take more
than 75 seconds to give up.
 However, the FD is considered valid. But if application do a select on the
FD, no event on it. Ares_processs seems not trying other servers either.

 I did a fix of this problem 2 by checking if the FD is really connected
before queue the query to the connection. How to detect if the connection is
setup of not? I simply call the unblocking connect again. If the new
connect() returns error EINPROGRESS or EAGAIN or EALREADY, then means the
connection is still in progress and it cannot be used to send queries. If
connect() return no error or return error EISCONN, means it's connected and
ready to send queries. If it's not connected, I call next_server() to try
connecting to other servers. Here is code I added in ares__send_query():

  printf("test tcp_socket...\n" );
  ares_socket_t s;
  struct sockaddr_in sockin;

  /* Acquire a socket. */
  s = server->tcp_socket;
  /* test Connect to the server. */
  memset(&sockin, 0, sizeof(sockin));
  sockin.sin_family = AF_INET;
  sockin.sin_addr = server->addr;
  sockin.sin_port = (unsigned short)(channel->tcp_port & 0xffff);
  if (connect(s, (struct sockaddr *) &sockin, sizeof(sockin)) == -1) {
   int err = SOCKERRNO;

   if (err == EINPROGRESS || err == EWOULDBLOCK || err==EALREADY) {
    printf("connect still EINPROGRESS, cannot use this one, skip to
next\n");
    next_server(channel, query, now);
    return;
   }
   if (err == EISCONN) {
    printf("aready connected, use this one, fd is %d\n",
server->tcp_socket);
   }
  } else {
   printf("connect() return no error, use this one %d\n",
server->tcp_socket);
  }

 This fix seems resolved the problem of not being able to use secondary DNS
server when tcp setup to primary DNS server is in progress. But what happens
if primary DNS server is up and connection has been setup successfully at
beginning but later it crashes (for example, unplug the network cable)? In
that case c-ares will be fooled thinking the FD is still usable and indeed
write() to it will not get error! My solution here is to apply tcp KeepAlive
on the connection. Here is the code I added in open_tcp_socket() function:

  /* Set the socket non-blocking. */
 nonblock(s, TRUE);

 int opt = 1;
 if (setsockopt(s, SOL_SOCKET, SO_KEEPALIVE, (const void *)&opt,
sizeof(opt)) < 0) {
  printf("setsockopt keepalive error, errno: %d\n", errno);
 }
 int keepIdle=2;
 if (setsockopt(s, IPPROTO_TCP, TCP_KEEPIDLE, (const void *)&keepIdle,
sizeof(keepIdle)) < 0) {
  printf("setsockopt keepalive-keepIdle error, errno: %d\n", errno);
 }
 int keepCnt=1;
 if (setsockopt(s, IPPROTO_TCP, TCP_KEEPCNT, (const void *)&keepCnt,
sizeof(keepCnt)) < 0) {
  printf("setsockopt keepalive-keepCnt error, errno: %d\n", errno);
 }
 int keepIntvl=1;
 if (setsockopt(s, IPPROTO_TCP, TCP_KEEPINTVL, (const void *)&keepIntvl,
sizeof(keepIntvl)) < 0) {
  printf("setsockopt keepalive-keepIntvl error, errno: %d\n", errno);
 }

If the primary DNS server is offline in the middle, the keepalive mechanism
can shutdown the socket, making the FD un-writable and forcing write()
return errors, the ares_process() will detect it and switch to other
connections.
Now it seems working as expected. If someone can fix the Problem 1, then it
will be perfect.
I will attach the whole ares_process.c file that I modified here for your
review (a little messy by inserting logs, again, I only considered compiling
in linux so far). So far I still don't understand most of the c-ares code
yet, hopefully I didn't break anything. Please review it and maybe there are
better ways to achieve this (for example, we can try other connections once
write() returns errors, or add another state to tcp_socket).

Thanks,
Anlin.

Received on 2007-08-09