Alpine + Kubernetes + DNS
Posted
Weary travelers breaking into the Kubernetes world – this post is for you.
During my journey into K8s, I’ve tried to use a lot of Alpine based images. In many cases, they are considerably more lightweight than the other available images, which makes them especially attractive. But, there are some issues and some workarounds.
From my reading, it seemed that all the DNS strangeness should have been fixed around v3.5 of Alpine. I’ve recently been encountering some more strange behaviour. Every image that is not based on Alpine has had working DNS name resolution, both internal and external to the cluster (ie., can resolve both service.default.svc.cluster.local
and google.com
). On the flip-side, all Alpine images I have tried have not been able to resolve anything other than internal cluster DNS. And, of course, setting dnsPolicy
to Default
in a PodSpec
causes the exact inverse issue – only external DNS can be resolved – which is expected.
To clarify, in a clean Alpine 3.6 image (built from here) running on a Kubernetes 1.7.4 cluster, this is what is happening:
/ # nslookup google.com
nslookup: can't resolve '(null)': Name does not resolve
nslookup: can't resolve 'google.com': Name does not resolve
And this is the fairly normal resolv.conf
:
nameserver 10.96.0.10
search staging.svc.cluster.local svc.cluster.local cluster.local kubec.maio.me
options ndots:5
Earlier today, I read something about the search
directive having some funky behaviour on Alpine, which may potentially be related to this?
So, I modified the resolv.conf
to remove the unresolvable search domain that got pulled in from the nodes’ resolv.conf
:
nameserver 10.96.0.10
search staging.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
Et voila.
/ # nslookup google.com
nslookup: can't resolve '(null)': Name does not resolve
Name: google.com
Address 1: 173.194.67.102 od-in-f102.1e100.net
Address 2: 173.194.67.101 od-in-f101.1e100.net
Address 3: 173.194.67.139 od-in-f139.1e100.net
Address 4: 173.194.67.100 od-in-f100.1e100.net
Address 5: 173.194.67.138 od-in-f138.1e100.net
Address 6: 173.194.67.113 od-in-f113.1e100.net
Address 7: 2607:f8b0:4003:c17::71 od-in-x71.1e100.net
/ # nslookup pg-prod-postgresql.default
nslookup: can't resolve '(null)': Name does not resolve
Name: pg-prod-postgresql.default
Address 1: 10.106.241.98 pg-prod-postgresql.default.svc.cluster.local
Suddenly, both internal and external name resolution works. It seems that the long-term solution is to remove that unresolvable search domain from the resolv.conf
on all the nodes. Although, this discovery presents the question as to why the MUSL DNS resolver is is choking on that last “unresolvable” search domain?
So I kept digging.
I tried adding another known, unrelated, unresolvable search domain to the resolv.conf
:
nameserver 10.96.0.10
search staging.svc.cluster.local svc.cluster.local cluster.local nope.thisdoesnotresolve.info
options ndots:5
… But both resolutions still work! Why?
Maybe strace
can give us some clues?
... snip ...
## trying to resolve google.com.nope.thisdoesnotresolve.info
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 3
bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
sendto(3, "\310{\1\0\0\1\0\0\0\0\0\0\6google\3com\4nope\22thi"..., 57, MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, 16) = 57
sendto(3, "\311\253\1\0\0\1\0\0\0\0\0\0\6google\3com\4nope\22thi"..., 57, MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, 16) = 57
poll([{fd=3, events=POLLIN}], 1, 2500) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "\311\253\201\203\0\1\0\0\0\1\0\0\6google\3com\4nope\22thi"..., 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, [16]) = 117
recvfrom(3, 0x7ffd84553d10, 512, 0, 0x7ffd845539c0, [16]) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN}], 1, 2439) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "\310{\201\203\0\1\0\0\0\1\0\0\6google\3com\4nope\22thi"..., 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, [16]) = 117
close(3) = 0
## trying to resolve google.com!
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 3
bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
sendto(3, "\3061\1\0\0\1\0\0\0\0\0\0\6google\3com\0\0\1\0\1", 28, MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, 16) = 28
sendto(3, "\307\22\1\0\0\1\0\0\0\0\0\0\6google\3com\0\0\34\0\1", 28, MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, 16) = 28
poll([{fd=3, events=POLLIN}], 1, 2500) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "\3061\201\200\0\1\0\6\0\0\0\0\6google\3com\0\0\1\0\1\300\f\0\1"..., 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, [16]) = 124
recvfrom(3, 0x7ffd84553f10, 512, 0, 0x7ffd845539c0, [16]) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN}], 1, 2444) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "\307\22\201\200\0\1\0\1\0\0\0\0\6google\3com\0\0\34\0\1\300\f\0\34"..., 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, [16]) = 56
close(3) = 0
... snip ...
So, what happens when we’re using my “unresolvable” search domain?
... snip ...
## trying to resolve google.com.kubec.maio.me
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 3
bind(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
sendto(3, "\337<\1\0\0\1\0\0\0\0\0\0\6google\3com\5kubec\4ma"..., 42, MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, 16) = 42
sendto(3, "\3402\1\0\0\1\0\0\0\0\0\0\6google\3com\5kubec\4ma"..., 42, MSG_NOSIGNAL, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, 16) = 42
poll([{fd=3, events=POLLIN}], 1, 2500) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "\337<\201\200\0\1\0\0\0\1\0\0\6google\3com\5kubec\4ma"..., 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, [16]) = 104
recvfrom(3, 0x7ffe5bf37740, 512, 0, 0x7ffe5bf371f0, [16]) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=3, events=POLLIN}], 1, 2433) = 1 ([{fd=3, revents=POLLIN}])
recvfrom(3, "\3402\201\200\0\1\0\0\0\1\0\0\6google\3com\5kubec\4ma"..., 512, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, [16]) = 104
close(3) = 0
write(2, "nslookup: can't resolve 'google."..., 60nslookup: can't resolve 'google.com': Name does not resolve
) = 60
exit_group(1) = ?
+++ exited with 1 +++
It doesn’t even go on to retry without a search domain… So lets do an additional test outside of the cluster:
λ host -v google.com.nope.thisdoesnotresolve.info 8.8.8.8
Trying "google.com.nope.thisdoesnotresolve.info"
Using domain server:
Name: 8.8.8.8
Address: 8.8.8.8#53
Aliases:
Host google.com.nope.thisdoesnotresolve.info not found: 3(NXDOMAIN)
Received 117 bytes from 8.8.8.8#53 in 41 ms
Received 117 bytes from 8.8.8.8#53 in 41 ms
λ host -v google.com.kubec.maio.me 8.8.8.8
Trying "google.com.kubec.maio.me"
Using domain server:
Name: 8.8.8.8
Address: 8.8.8.8#53
Aliases:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 48848
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; QUESTION SECTION:
;google.com.kubec.maio.me. IN A
;; AUTHORITY SECTION:
maio.me. 1799 IN SOA elle.ns.cloudflare.com. dns.cloudflare.com. 2025556331 10000 2400 604800 3600
Received 104 bytes from 8.8.8.8#53 in 67 ms
Trying "google.com.kubec.maio.me"
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 17593
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; QUESTION SECTION:
;google.com.kubec.maio.me. IN AAAA
;; AUTHORITY SECTION:
maio.me. 1799 IN SOA elle.ns.cloudflare.com. dns.cloudflare.com. 2025556331 10000 2400 604800 3600
Received 104 bytes from 8.8.8.8#53 in 40 ms
Trying "google.com.kubec.maio.me"
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50055
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; QUESTION SECTION:
;google.com.kubec.maio.me. IN MX
;; AUTHORITY SECTION:
maio.me. 1145 IN SOA elle.ns.cloudflare.com. dns.cloudflare.com. 2025555945 10000 2400 604800 3600
Received 104 bytes from 8.8.8.8#53 in 20 ms
So:
- Trying to resolve
google.com.nope.thisdoesnotresolve.info
correctly returns anNXDOMAIN
. - Trying to resolve
google.com.kubec.maio.me
incorrectly returns blank records…
The problem seems to trace all the way up to Cloudflare, and I know for certain that I don’t have a record for *.kubec.maio.me
… Which leaves only one possible cause I have yet to rule out… Cloudflare’s CNAME flattening. I have the zone apex set as a CNAME
of kube-n01.kubec.maio.me
, which doesn’t explain why Cloudflare is returning an empty record set instead of an NXDOMAIN
(or even responding with CNAME kube-n01.kubec.maio.me
), but, this is the only reasonable conclusion I can come to.
With that, I leave you a haiku:
It appears that I am not the only one with this issue!
This is extremely frustrating – it seems that the only viable solutions are to either a) change node hostnames to something that can not be derived into a search domain, or b) switch DNS providers. :(