Prequisites
Software required on machine: - git + git-lfs - Docker
You can use the script with SUDO
rights, located at /scripts/installation_utils/install_docker_and_utils.sh
, it can be used on Debian/Ubuntu/CentOS/RedHAT RHEL 8 only, run it once and everything should be set up.
Consult the (Docker installation steps
)[https://docs.docker.com/engine/install/debian/] if there are issues with the docker setup.
IMPORTANT NOTE: Do a git-lfs pull
so that you have everything downloaded from the repo (including bigger zipped files.).
Deployment
./deploy contains an example deployment of the customised NiFi image with related services for document processing, NLP and text analytics.
The key files are:
services.yml
- defines all the available services in docker-compose format. K8s (i.e. multi container service deployments is coming soon…)Makefile
- scripts for running docker-compose commands,.env
- local environment variables definitions, deployment.env
files are located in the/deploy
folder, security.env
files are located in the/security
folder, containing users and certificate generation settings. The above mentioned files should be the files that you will most likely need to change during a deployment.
Individual service configurations are provided in ./services
.
Apache NiFi-related files are provided in ./nifi
directory.
Environment variables
As mentioned above, environment variables have been made available after release 1.0.
The variables are configurable, and are separated, into security and general env vars, furthermore, all services declared in the services.yml
file have their variables in separate files.
In most cases, modifying these variables should be the only thing that is needed in order to run a successful deployment.
Multiple files are available, split into two categories:
service: located in
./deploy/
are reponsible for direct service configurationsecurity: located in
./security
, ceriticate related settings are always in the files starting withcertificates_
and user settings are located in the files ending with_users
The variables declared in the ./deploy
folder are used in multiple config files, as follows:
elasticsearch.env
, variables here are used in :./services/elasticsearch/config/(opensearch|elasticsearch).yml
./services/kibana/config/(opensearch|elasticsearch).yml
./services/metricbeat/metricbeat.yml
./deploy/services.yml
in the following sections:nifi
,elasticsearch-1
,elasticsearch-1
,elasticsearch-3
,kibana
,metricbeat-1
,metricbeat-2
nifi.env
, vars used in:./deploy/services.yml
, sections:nifi
./nifi/conf/nifi.properties
jupyter.env
, vars used in:./deploy/services.yml
, sections:jupyter
nlp_service.env
, vars used in:./deploy/services.yml
, sections:nlp-medcat-service-production
database.env
, vars used in:./deploy/services.yml
, sections:cogstack-databank-db
,samples-db
general.env
, these vars are optional, declared any custom variables you want here, used in thenifi
section
Additional variablesenv files, used only or certificate generation and user accounts, found in ./security
:
certificates_elasticsearch.env
, used increate_opensearch_*
/create_es_native*
scriptscertificates_general.env
, used increate_root_ca.sh
certificates_nifi.env
, used innifi_toolkit_security.sh
database_users.env
elasticsearch_users.env
nginx_users.env
Customization
For custom deployments, copy all the .env
files (which are not tracked by Git) and add deployment specific configurations to these files. For example:
cp deploy/*.env deploy/new_deploy_folder/
cp security/*.env deploy/new_deploy_folder/
Multiple deployments on the same machine
When deploying multiple docker-compose projects on the same machine (e.g. for dev or testing), it can be useful to remove all containers, volume and network names from the docker-compose file, and let Docker create names based on COMPOSE_PROJECT_NAME
in deploy/.env
. Docker will automatically create a Docker network and makes sure that containers can find each other by container name.
For example, when setting COMPOSE_PROJECT_NAME=cogstack-prod
, Docker Compose will create a container named cogstack-prod_elasticsearch-1_1
for the elasticsearch-1
service. Within the NiFi container, which is running in the same Docker network, you can refer to that container using just the service name elasticsearch-1
.
Important security detail
Please note that in the example service defintions, for ease of deployment and demonstration, SSL encryption is enabled among services (NiFi, ES, etc.), however, the certificates that are used are in this public repository, anyone can see them, so please make sure to re-generate them when you go into production.
Services
Please note that all the services are deployed using Docker engine and requires docker deamon to be running / functioning.
Please see the available services for more details.
Workflows
Apache NiFi provides users the ability to build very large and complex data flows. These data flows can be later saved as workflow templates, exported into XML format and shared with other users. We provide few example templates for ingesting the records from a database into Elasticsearch and to perform extraction of NLP annotations from documents.
Deployment using Makefile
For deployments based on the example workflows, please see example workflows for more details.
Deployment using a custom Docker-compose
When using a fork of this repository for a customized deployments, it can be useful to copy services.yml
to a deployment-specific docker-compose.yml
. In this Compose file you can specify the services you need for your instance and configure all parameters per service, as well as track this file in a branch in your own fork. This way you can use your own version control and rebase on CogStack/CogStack-NiFi
master without running into merge conflicts.
Troubleshooting
Always start with fresh containers and volumes, to make sure that there are no volumes from previous experimentations, make sure to always delete all/any cogstack running containers by executing:
docker container rm samples-db elasticsearch-1 kibana nifi nlp-medcat-service-production tika-service nlp-gate-drugapp nlp-medcat-snomed nlp-gate-bioyodie medcat-trainer-ui medcat-trainer-nginx jupyter-hub -f
followed by a cleanup or dangling volumes (careful as this will remove all volumes which are NOT being used by a container, if you want to remove specific volumes you will have to manually specifiy the volume names), otherwise, you can specify :
docker volume prune -f
WARNING THIS WILL DELETE ALL UNUSED VOLUMES ON YOUR MACHINE!. Check the volume names used in services.yml file and delete them as necessary dockr volume rm volume_name
Known Issues/errors
Common issues that can be encountered across services.
NiFi
When dealing with contaminated deployments ( containers using volumes from previous instances ) :
- NiFi only supports one mode of HTTP or HTTPS operation...
deleting the volumes should usually solve this issue, if not, please check the nifi.properties
if there have been modifications done by yourself or a developer on it.
- building the NiFi image manually on a restricted system, this is usually not necessary, but if for some reason this needs to be done then some settings such as proxy configs might need to be set up in the nifi/Dockerfile
epecially ones related to the grape
application and dealing with external downloads.
- keystore.jks
/truststore.jks
related errors, remove the nifi container & related volumes then restart the nifi instance.
- System Error: Invalid host header : this occurs when nifi host has not been properly configured
, please check the /nifi/conf/nifi.properties
file and set the nifi.web.proxy.host
property to the IP address of the server along with the port <host>:<port>
, if this does not work then it is usually a proxy/network configuration problem (also check firewalls), another workaround would be to comment out the following subsections of the nifi
service in the services.yml
file : ports:
and networks
with all their child settings. After this is done the following property should be added network_mode: host
, restart the instance using the docker-compoes -f services.yml up -d nifi
command afterwards.
- Possible error when dealing with non-pgsql databases due to Incorrect syntax near 'LIMIT'.; routing to failure: com.microsoft.sqlserver.jdbc.SQLServerException: Incorrect syntax near 'LIMIT'
, go to the GenerateTableFetch Process -> right-click -> configure -> change database type from Generic to -> MS SQL 2012 + or 2008 (if an older DB system is used)
- Possible error on Linux systems related to nifi.properties
permission error and/or other files from the nifi/conf/
folder, please see the nifi doc {nifi.properties} section.
- Driver class org.postgresql.Driver is not found
or something similar for other MSSQL/SQL drivers, this is a known issue after NiFi version v1.20+, first, make sure you pull the latest version of the repository, then for the JAR file you are using, please execute the following command in order to verify its integrity jar -tvf ./nifi/drivers/your_file_version.jar
, if this returns a list of files and NO errors then the files are not corrupted and can be loaded. On the NiFi side make sure to go to the DBCPConnectionPool
controller service and verify the propertiesit a few times, make sure the file path is correct and in the following format: file:///opt/nifi/drivers/postgresql-42.6.0.jar
for example. If all this fails stop nifi, delete all the Docker volumes associated with it -> restart NiFi, perform the above steps again. You can try forcefully starting the GenerateTableFetch
or QueryDatabaseTable
processors by enabling the DBCPConnectionPool
even if an error popus up after clicking the verify button.
- 502 Bad Gateway
, NiFi simply not starting, even after waiting more than 2-3 minutes. This can occur due to a wide variety of issues, you can check the NiFi container log : “docker logs -f –tail 1000 cogstack-nifi > my_log_file.txt” to capture the output easily. The most common cause is running out of memory, increase or decrease the limits in nifi/conf/bootstrap.conf
according to your machine’s spec, please read bootstrap.conf
Elasticsearch Errors
VM memory errors, failed bootstrap check
It is quite a common issue for both opensearch and native-ES to error out when it comes to virtual memory allocation, this error typically comes in the form of :
ERROR: [1] bootstrap checks failed
[1]: max virtual memory areas vm.max_map_count [65111] is too low, increase to at least [262144]
To solve this one needs to simply execute :
- on Linux/Mac OS X :
sysctl -w vm.max_map_count=262144
in terminal.
To make the same change systemwide plase add vm.max_map_count=262144
to /etc/sysctl.conf and restart the dockerservice/machine.
An example of this can be found under /services/elasticsearch/sysctl.conf
- on Windows you need to enter the following commands in a powershell instance:
wsl -d docker-desktop
sysctl -w vm.max_map_count=262144
For more on this issue please read: https://www.elastic.co/guide/en/elasticsearch/reference/current/vm-max-map-count.html
OpenSearch: validating opensearch.yml hosts
FATAL Error: [config validation of [opensearch].hosts]: types that failed validation:
- [config validation of [opensearch].hosts.0]: expected URI with scheme [http|https].
- [config validation of [opensearch].hosts.1]: could not parse array value from json input
This issue may appear after the recent switch to using fully customizable environment variables. Strings and ENV vars may be parsed differently depending on the shell version found on the host system.
To solve this, the easiest way is to make sure to load the elasticsearch.env
variables before starting the Elastic & Kibana containers by doing the following:
cd ./deploy/
set -a
source elasticsearch.env
make start-elastic
Alternatively (if the script executes without issues):
cd ./deploy/
source export_env_vars.sh
make start-elastic
DB-samples issues
No table data for samples_db
It is possible that you may have forgotten to pull the large files from the repo, please do : git lfs pull
.
Delete the samples-db container and it’s volumes and restart it, you should now see the data in the tables.