Troubleshooting (AEN 4.2.1)#
This troubleshooting guide provides you with ways to deal with issues that may occur with your AEN installation.
- General troubleshooting steps
- Browser error: too many redirects
- Error: unix:////opt/wakari/wakari-server/etc/supervisor.sock no such file
- Error: “Data Center Not Found” when deleting a project
- Forgotten administrator password
- Log files being deleted
- Error: This socket is closed
- Service error 502: Cannot connect to the application manager
- 502 communication error on Amazon web services (AWS)
- Invalid username
- Notebook Error: Cannot download notebook as PDF via LaTeX
- Unresponsive
wk-server
thread without error messages - Unresponsive
wk-gateway
thread without error messages
General troubleshooting steps¶
- Clear browser cookies. When you change the AEN configuration or upgrade AEN, cookies remaining in the browser can cause issues. Clearing cookies and logging in again can help to resolve problems.
- Make sure NGINX and MongoDB are running.
- Make sure that AEN services are set to start at boot, on all nodes.
- Make sure that services are running as expected. If any services are not running or are missing, restart them.
- Check for and remove extraneous processes.
- Check the connectivity between nodes.
- Check the configuration file syntax.
- Check file ownership.
- Verify that POSIX ACLs are enabled.
Browser error: too many redirects¶
Cause¶
Browser cookies are out of date.
Solution¶
- Log out.
- Clear the browser’s cookies.
- Clear the browser cache.
- Log in.
Error: unix:////opt/wakari/wakari-server/etc/supervisor.sock no such file¶
This is a supervisorctl error.
Cause¶
supervisord is not running on the Server.
Solution¶
Ensure that supervisord is included in the crontab. Then restart supervisord manually.
Error: “Data Center Not Found” when deleting a project¶
Cause¶
The data center has been removed.
Solution¶
As root, run:
/opt/wakari/wakari-server/bin/wk-server-admin remove-project --db-only <user> <project>
Forgotten administrator password¶
Use ssh to log in to the server as root.
Run:
/opt/wakari/wakari-server/bin/wk-server-admin reset-password -u SOME_USER -p SOME_PASSWORD
NOTE: Replace SOME_USER with the administrator username and SOME_PASSWORD with the password.
Log in to AEN as the administrator user with the new password.
Alternatively you may add an administrator user:
Use ssh to log in to the server as root.
Run:
/opt/wakari/wakari-server/bin/wk-server-admin add-user SOME_USER --admin -p SOME_PASSWORD -e YOUR_EMAIL
NOTE: Replace SOME_USER with the username, replace SOME_PASSWORD with the password, and replace YOUR_EMAIL with your email address.
Log in to AEN as the administrator user with the new password.
Log files being deleted¶
Log files are being deleted.
NOTE: Locations of AEN log files for each process and application are shown in the node sections in Concepts.
Cause¶
AEN installers log in to
/tmp/wakari\_{server,gateway,compute}.log
. If the log files
grow too large, they might be deleted.
Solution¶
To set the logs to be more or less verbose, Jupyter Notebooks uses Application.log_level.
To make the logs less verbose than the default, but still informative, set Application.log_level to ERROR.
Error: This socket is closed¶
You receive the “This socket is closed” error message when you try to start an application.
Cause¶
When the supervisord process is killed, information sent to the
standard output stdout
and the standard error stderr
is
held in a pipe that will eventually fill up.
Once full, attempting to start any application will cause the “This socket is closed” error.
Solution¶
To prevent this issue:
- Follow the instructions in Managing services to stop and restart processes.
- Do not stop or kill supervisord without first stopping wk-compute and any other processes that use it.
To resolve the “This socket is closed” error:
Stop wk-compute by running
sudo kill -9
.Restart the supervisord and wk-compute processes:
sudo /etc/init.d/wakari-compute stop sudo /etc/init.d/wakari-compute start
Service error 502: Cannot connect to the application manager¶
Gateway node displays “Service Error 502: Can not connect to the application manager.”
Cause¶
A compute node is not responding because the wk-compute process has stopped.
Solution¶
Stop and then restart the supervisord and wk-compute processes:
sudo /etc/init.d/wakari-compute stop
sudo /etc/init.d/wakari-compute start
502 communication error on Amazon web services (AWS)¶
You receive the “502 Communication Error: This gateway could not communicate with the Wakari server” error message.
Cause¶
An AEN gateway cannot communicate with the Wakari server on AWS. There may be an issue with the IP address of the Wakari server.
Solution¶
Configure your AEN gateway to use the DNS hostname of the server. On AWS this is the DNS hostname of the Amazon Elastic Compute Cloud (EC2) instance.
Invalid username¶
Cause¶
The username does not follow 1 or more of these rules:
- Must be at least 3 characters and no more than 25 characters.
- The first character must be a letter (A-Z) or a digit (0-9).
- Other characters can be a letter, digit, period (.), underscore (_) or hyphen (-).
- The POSIX standard specifies that these characters are the portable filename character set, and that portable usernames have the same character set.
Solution¶
Follow the above rules for usernames.
Notebook Error: Cannot download notebook as PDF via LaTeX¶
Cause¶
LaTeX is not properly installed.
CentOS/6 Solution¶
Install TeXLive from the TUG site. Follow the described steps. The installation may take some time.
Add the installation to the
PATH
in the file/etc/profile.d/latex.sh
. Add the following, replacing the year and architecture as needed:PATH=/usr/local/texlive/2017/bin/x86_64-linux:$PATH
Restart the compute node.
CentOS/7 Solution¶
Install the missing packages running the command:
yum install texlive texlive-xetex texlive-xetexconfig texlive-xetex-def texlive-adjustbox texlive-upquote texlive-ulem
Unresponsive wk-server
thread without error messages¶
Cause¶
Two things can cause the wk-server
thread to freeze without error messages:
- LDAP freezing
- MongoDB freezing
If LDAP or MongoDB are configured with a long timeout, Gunicorn can time out first and kill the LDAP or MongoDB process. Then the LDAP or MongoDB process dies without logging a timeout error.
Solution¶
- Check for frozen LDAP or MongoDB server processes.
- You may also wish to configure the Gunicorn timeout to more than 30 seconds.
Unresponsive wk-gateway
thread without error messages¶
Cause¶
If TLS is configured with a passphrase protected private key,
wk-gateway
will freeze without any error messages.
Solution¶
Update the TLS configuration so that it does not use a passphrase protected private key.