Crying Cloud

View Original

WALinuxAgent issue in a Linux VM on Azure Stack

I was deploying some Linux RHEL systems yesterday on Azure Stack. Everything was working and then it wasn't.  The VM had deployed fine but the custom script extension had failed.  It takes 2 hours for the deployment to timeout. Deployment Failure Timeout 

Failed Script Extension

Failed Extensions Details

If you cancel the deployment and attempt to delete the resource group you will have to wait for the custom extension to timeout anyway, which takes two hours.  Microsoft recommends against canceling deployments and let them timeout, especially on Linux.  On the next deployment attempt, I got the same results.  On closer inspection of the stdout for custom script extension 2.0.6 (/var/lib/waagent/custom-script/download/0/stdout), I found this error.  warning: %postun(WALinuxAgent-2.2.18-1.el7.noarch) scriptlet failed, signal 15

Stdout log

Reviewing the bash script, the first line that was running was

yum -y clean all && yum -y clean metadata && yum -y clean dbcache && yum -y makecache && yum update -yvq

This appears to be a known issue, here is something someone has written with some information trying to explain what is happening https://github.com/Azure/WALinuxAgent/issues/178

I believe this explains the timeout.  The Agent had stopped and Azure Stack can no longer talk to the agent in the VM.  If you experience this specific issue adding an exclusion should leave the existing binaries alone.  If you don't have this specific error but are still experiencing a timeout it is probably worth checking the WALinuxAgent is still running.

yum update -y --exclude=WALinuxAgent