Alpine + Kubernetes + DNS

Posted

#alpine#kubernetes#dns

Weary travelers breaking into the Kubernetes world – this post is for you.

During my journey into K8s, I’ve tried to use a lot of Alpine based images. In many cases, they are considerably more lightweight than the other available images, which makes them especially attractive. But, there are some issues and some workarounds.

From my reading, it seemed that all the DNS strangeness should have been fixed around v3.5 of Alpine. I’ve recently been encountering some more strange behaviour. Every image that is not based on Alpine has had working DNS name resolution, both internal and external to the cluster (ie., can resolve both service.default.svc.cluster.local and google.com). On the flip-side, all Alpine images I have tried have not been able to resolve anything other than internal cluster DNS. And, of course, setting dnsPolicy to Default in a PodSpec causes the exact inverse issue – only external DNS can be resolved – which is expected.

To clarify, in a clean Alpine 3.6 image (built from here) running on a Kubernetes 1.7.4 cluster, this is what is happening:

/ # nslookup google.com
nslookup: can't resolve '(null)': Name does not resolve

nslookup: can't resolve 'google.com': Name does not resolve

And this is the fairly normal resolv.conf:

nameserver 10.96.0.10
search staging.svc.cluster.local svc.cluster.local cluster.local kubec.maio.me
options ndots:5

Earlier today, I read something about the search directive having some funky behaviour on Alpine, which may potentially be related to this?

So, I modified the resolv.conf to remove the unresolvable search domain that got pulled in from the nodes’ resolv.conf:

nameserver 10.96.0.10
search staging.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

Et voila.

/ # nslookup google.com
nslookup: can't resolve '(null)': Name does not resolve

Name:      google.com
Address 1: 173.194.67.102 od-in-f102.1e100.net
Address 2: 173.194.67.101 od-in-f101.1e100.net
Address 3: 173.194.67.139 od-in-f139.1e100.net
Address 4: 173.194.67.100 od-in-f100.1e100.net
Address 5: 173.194.67.138 od-in-f138.1e100.net
Address 6: 173.194.67.113 od-in-f113.1e100.net
Address 7: 2607:f8b0:4003:c17::71 od-in-x71.1e100.net

/ # nslookup pg-prod-postgresql.default
nslookup: can't resolve '(null)': Name does not resolve

Name:      pg-prod-postgresql.default
Address 1: 10.106.241.98 pg-prod-postgresql.default.svc.cluster.local

Suddenly, both internal and external name resolution works. It seems that the long-term solution is to remove that unresolvable search domain from the resolv.conf on all the nodes. Although, this discovery presents the question as to why the MUSL DNS resolver is is choking on that last “unresolvable” search domain?


So I kept digging.

I tried adding another known, unrelated, unresolvable search domain to the resolv.conf:

nameserver 10.96.0.10
search staging.svc.cluster.local svc.cluster.local cluster.local nope.thisdoesnotresolve.info
options ndots:5

… But both resolutions still work! Why? Maybe strace can give us some clues?

... snip ...

## trying to resolve google.com.nope.thisdoesnotresolve.info
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 3
bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
sendto(3, "\310{\1\0\0\1\0\0\0\0\0\0\6google\3com\4nope\22thi"..., 57, MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, 16) = 57
sendto(3, "\311\253\1\0\0\1\0\0\0\0\0\0\6google\3com\4nope\22thi"..., 57, MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, 16) = 57
poll([{fd=3, events=POLLIN}], 1, 2500)  = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "\311\253\201\203\0\1\0\0\0\1\0\0\6google\3com\4nope\22thi"..., 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, [16]) = 117
recvfrom(3, 0x7ffd84553d10, 512, 0, 0x7ffd845539c0, [16]) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN}], 1, 2439)  = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "\310{\201\203\0\1\0\0\0\1\0\0\6google\3com\4nope\22thi"..., 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, [16]) = 117
close(3)                                = 0

## trying to resolve google.com!
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 3
bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
sendto(3, "\3061\1\0\0\1\0\0\0\0\0\0\6google\3com\0\0\1\0\1", 28, MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, 16) = 28
sendto(3, "\307\22\1\0\0\1\0\0\0\0\0\0\6google\3com\0\0\34\0\1", 28, MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, 16) = 28
poll([{fd=3, events=POLLIN}], 1, 2500)  = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "\3061\201\200\0\1\0\6\0\0\0\0\6google\3com\0\0\1\0\1\300\f\0\1"..., 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, [16]) = 124
recvfrom(3, 0x7ffd84553f10, 512, 0, 0x7ffd845539c0, [16]) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN}], 1, 2444)  = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "\307\22\201\200\0\1\0\1\0\0\0\0\6google\3com\0\0\34\0\1\300\f\0\34"..., 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, [16]) = 56
close(3)                                = 0

... snip ...

So, what happens when we’re using my “unresolvable” search domain?

... snip ...

## trying to resolve google.com.kubec.maio.me
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 3
bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
sendto(3, "\337<\1\0\0\1\0\0\0\0\0\0\6google\3com\5kubec\4ma"..., 42, MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, 16) = 42
sendto(3, "\3402\1\0\0\1\0\0\0\0\0\0\6google\3com\5kubec\4ma"..., 42, MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, 16) = 42
poll([{fd=3, events=POLLIN}], 1, 2500)  = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "\337<\201\200\0\1\0\0\0\1\0\0\6google\3com\5kubec\4ma"..., 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, [16]) = 104
recvfrom(3, 0x7ffe5bf37740, 512, 0, 0x7ffe5bf371f0, [16]) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN}], 1, 2433)  = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "\3402\201\200\0\1\0\0\0\1\0\0\6google\3com\5kubec\4ma"..., 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, [16]) = 104
close(3)                                = 0
write(2, "nslookup: can't resolve 'google."..., 60nslookup: can't resolve 'google.com': Name does not resolve
) = 60
exit_group(1)                           = ?
+++ exited with 1 +++

It doesn’t even go on to retry without a search domain… So lets do an additional test outside of the cluster:

λ host -v google.com.nope.thisdoesnotresolve.info 8.8.8.8
Trying "google.com.nope.thisdoesnotresolve.info"
Using domain server:
Name: 8.8.8.8
Address: 8.8.8.8#53
Aliases:

Host google.com.nope.thisdoesnotresolve.info not found: 3(NXDOMAIN)
Received 117 bytes from 8.8.8.8#53 in 41 ms
Received 117 bytes from 8.8.8.8#53 in 41 ms

λ host -v google.com.kubec.maio.me 8.8.8.8
Trying "google.com.kubec.maio.me"
Using domain server:
Name: 8.8.8.8
Address: 8.8.8.8#53
Aliases:

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 48848
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;google.com.kubec.maio.me.      IN      A

;; AUTHORITY SECTION:
maio.me.                1799    IN      SOA     elle.ns.cloudflare.com. dns.cloudflare.com. 2025556331 10000 2400 604800 3600

Received 104 bytes from 8.8.8.8#53 in 67 ms
Trying "google.com.kubec.maio.me"
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 17593
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;google.com.kubec.maio.me.      IN      AAAA

;; AUTHORITY SECTION:
maio.me.                1799    IN      SOA     elle.ns.cloudflare.com. dns.cloudflare.com. 2025556331 10000 2400 604800 3600

Received 104 bytes from 8.8.8.8#53 in 40 ms
Trying "google.com.kubec.maio.me"
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50055
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0

;; QUESTION SECTION:
;google.com.kubec.maio.me.      IN      MX

;; AUTHORITY SECTION:
maio.me.                1145    IN      SOA     elle.ns.cloudflare.com. dns.cloudflare.com. 2025555945 10000 2400 604800 3600

Received 104 bytes from 8.8.8.8#53 in 20 ms

So:

The problem seems to trace all the way up to Cloudflare, and I know for certain that I don’t have a record for *.kubec.maio.me… Which leaves only one possible cause I have yet to rule out… Cloudflare’s CNAME flattening. I have the zone apex set as a CNAME of kube-n01.kubec.maio.me, which doesn’t explain why Cloudflare is returning an empty record set instead of an NXDOMAIN (or even responding with CNAME kube-n01.kubec.maio.me), but, this is the only reasonable conclusion I can come to.

With that, I leave you a haiku:

DNS Haiku


It appears that I am not the only one with this issue!

This is extremely frustrating – it seems that the only viable solutions are to either a) change node hostnames to something that can not be derived into a search domain, or b) switch DNS providers. :(