Error: Bonita server not responding after an inactivity period

unai.gaston.caminos's picture
unai.gaston.caminos
Blog Categories: 

Component: Engine, Tomcat server | Version: 7.9+

Problem

You have set up your Bonita so it connects to your database server that is located in a private network protected by a firewall.

Everything works fine: you are able to connect to the Bonita portal; you can check your cases, you can execute your tasks; your processes move forward normally.

But then, if your platform remains inactive for some time (~1 hour, for example), when you get back, the server is not responding any more.

In your bonita.yyyy-mm-dd.log file, you can see this type of error:

// ---> first DB transaction after inactivity period (e.g. triggered by login API call)
2020-04-29 18:46:20.335 +0000 FINEST (ajp-nio-8009-exec-5) org.bonitasoft.engine.api.impl.ServerAPIImpl Starting Server API call org.bonitasoft.engine.api.LoginAPI login
2020-04-29 18:46:20.336 +0000 FINEST (ajp-nio-8009-exec-5) org.bonitasoft.engine.transaction.JTATransactionServiceImpl THREAD_ID=80 | HOSTNAME=SRV-WASBONITA02 | Beginning transaction in thread 80 null
...
...
// ---> XA transaction timeout error
2020-04-29 18:49:20.337 +0000 WARN (Transaction Reaper) com.arjuna.ats.arjuna ARJUNA012117: TransactionReaper::check timeout for TX 0:ffff0a01067b:82ff:5ea9a515:207a5 in state RUN
2020-04-29 18:49:20.339 +0000 WARN (Transaction Reaper Worker 0) com.arjuna.ats.arjuna ARJUNA012095: Abort of action id 0:ffff0a01067b:82ff:5ea9a515:207a5 invoked while multiple threads active within it.
2020-04-29 18:49:20.342 +0000 WARN (Transaction Reaper Worker 0) com.arjuna.ats.arjuna ARJUNA012381: Action id 0:ffff0a01067b:82ff:5ea9a515:207a5 completed with multiple threads - thread ajp-nio-8009-exec-5 was in progress with
java.base@11.0.7/java.net.SocketInputStream.socketRead0(Native Method)
java.base@11.0.7/java.net.SocketInputStream.socketRead(SocketInputStream.java:115)
java.base@11.0.7/java.net.SocketInputStream.read(SocketInputStream.java:168)
java.base@11.0.7/java.net.SocketInputStream.read(SocketInputStream.java:140)
...
org.postgresql.jdbc.PgPreparedStatement.executeQuery(PgPreparedStatement.java:106)
...
com.sun.proxy.$Proxy20.executeQuery(Unknown Source)
org.apache.tomcat.dbcp.dbcp2.PoolableConnection.validate(PoolableConnection.java:287)
org.apache.tomcat.dbcp.dbcp2.PoolableConnectionFactory.validateConnection(PoolableConnectionFactory.java:630)
...
org.apache.tomcat.dbcp.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:345)
...
org.apache.tomcat.dbcp.dbcp2.managed.ManagedDataSource.getConnection(ManagedDataSource.java:64)
org.apache.tomcat.dbcp.dbcp2.BasicDataSource.getConnection(BasicDataSource.java:794)
org.hibernate.engine.jdbc.connections.internal.DatasourceConnectionProviderImpl.getConnection(DatasourceConnectionProviderImpl.java:139)
...
org.bonitasoft.engine.platform.impl.PlatformServiceImpl.getDefaultTenant(PlatformServiceImpl.java:382)
...
org.bonitasoft.engine.transaction.JTATransactionServiceImpl.executeInTransaction(JTATransactionServiceImpl.java:286)
org.bonitasoft.engine.api.impl.LoginAPIImpl.getTenant(LoginAPIImpl.java:147)
org.bonitasoft.engine.api.impl.LoginAPIImpl.loginInternal(LoginAPIImpl.java:117)
org.bonitasoft.engine.api.impl.LoginAPIImpl.loginInternal(LoginAPIImpl.java:106)
com.bonitasoft.engine.api.impl.LoginAPIExt.login(LoginAPIExt.java:36)
...
org.keycloak.adapters.saml.servlet.SamlFilter.doFilter(SamlFilter.java:174)
org.bonitasoft.console.common.server.auth.impl.saml.BonitaSAML2Filter.processRequest(BonitaSAML2Filter.java:125)
org.bonitasoft.console.common.server.auth.impl.saml.BonitaSAML2Filter.doFilter(BonitaSAML2Filter.java:110)
...
java.base@11.0.7/java.lang.Thread.run(Thread.java:834) 

Why do you have this error?

The validation query executed before borrowing one of the connections in Tomcat's datasource's connection pool, is hanging.

As a result, the XA transaction is timing out. Note that In the above example, the transaction timeout has been set to 3 minutes (180 seconds) in the TOMCAT_HOME/server/conf/jbossts-properties.xml file:

<!-- (default is 60 seconds) -->
<entry key="com.arjuna.ats.arjuna.coordinator.defaultTimeout">180</entry>

(You may check this article for further information on the XA Transaction Timeout, even though raising its value will not help in the present case.)

What is happening?

Tomcat's datasource defines a minimum number of idle connections (minIdle) that must be kept in the pool at all time (where possible, that is, provided all maxTotal connections are not already active); for instance, in your TOMCAT_HOME/setup/tomcat-templates/bonita.xml:

    Resource name="bonitaDS"
              uniqueName="jdbc/bonitaDSXA"
              auth="Container"
              factory="org.jboss.narayana.tomcat.jta.TransactionalDataSourceFactory"
              transactionManager="TransactionManager"
              type="javax.sql.XADataSource"
              initialSize="8"
              maxTotal="50"
              minIdle="8"
              maxIdle="16"
              defaultAutoCommit="false"
              validationQuery="@@ds1.database_test_query@@"
              removeAbandonedOnBorrow="true"
              removeAbandonedOnMaintenance="true"
              logAbandoned="true"
              testOnBorrow="true"
              xaDataSource="RawBonitaDS"

This means that after a period of inactivity, at least minIdle connections to the database server will still remain open.

However, if the period of inactivity stretches beyond your database server's firewall's maximum TCP session duration (e.g. 3,600 seconds = 1 hour), the firewall will drop the connection.

Now, when you get back to your Bonita server after this inactivity period, Tomcat will try to borrow one of those idle connections to avoid having to establish a new one (as the latter can be relatively costly), so it will start by launching the validation query (defined in the validationQuery parameter) over the connection.

The problem comes from the fact that Tomcat's connection pool doesn't seem to cope well with this unexpected event:

  • it is unaware of the connection having been dropped by its peer, so the validation query hangs
  • the validation query timeout (validationQueryTimeout, if it were to be defined) is powerless: it doesn't manage to abort the hanging validation query

This is a known problem common to some connection pools, as the following article explains:

"(...) Various different pools provide such timeout mechanisms, but getting timeouts right under adverse network conditions is more difficult than it seems.

"(...)
"Dbcp2 did not return a connection, and also did not timeout. The execution of the validation query is stuck due to unacknowledged TCP traffic. Subsequently, the SQL Statement run on the (bad) connection by the test harness hangs (due to unacknowledged TCP). setMaxWait(5000) is seemingly useless for handling network outages. There are no other meaningful timeout settings that apply to a network failure.
"(...)
"Dbcp2: (D) - Seemingly hangs until the OS TCP timeout limit.

(...)"

You may read further on this glitch in the following articles, for example: here, and there.

Solution

Basically, there are two ways to get over this error:

  • Option A (discouraged): by raising the firewall's maximum TCP session duration (or similar) value. But then again, it may be difficult to estimate what the longest inactivity period of your Bonita server might be,so it will be hard to tell what value would be enough. Furthermore, there may be other applications besides the Bonita server accessing the same database server who would inevitably be affected by this option.
  • Option B (recommended): by adjusting Tomcat's connection pool's parameters so all the idle connections are renewed every so often.

Sticking to Option B, then, these are the parameters you would need to set in Tomcat's datasources in your TOMCAT_HOME/setup/tomcat-templates/bonita.xml:
(provided you start your server with the script start-bonita.sh/.bat; else, you would need to directly update the file TOMCAT_HOME/conf/Catalina/localhost/bonita.xml):

testWhileIdle="false" (default value)
timeBetweenEvictionRunsMillis=""
minEvictableIdleTimeMillis=""
softMinEvictableIdleTimeMillis="-1" (default value)


The goal is to ensure all idle connections are renewed after some time, even if that means going below the minIdle value. Thus, this is is the reasoning behind those values:

  • testWhileIdle="false" ---> so the decision whether to evict the idle connection or not does not depend on the result of the validation query
  • softMinEvictableIdleTimeMillis="-1" ---> so the eviction of the idle connection is not conditioned by the value of minIdle, and all connections may be closed regardless (anyway, the connection pool will automatically create as many new connections as necessary to grant the minIdle value)
  • timeBetweenEvictionRunsMillis and minEvictableIdleTimeMillis > 0 ---> so there is an evictor thread running
    NOTE: please be careful when choosing the values for these two parameters, as they will define the time an idle connection will remain open; you don't want that time to be too short so the idle connections stay open for some time, since having to create new connections too often may have a slight impact on Bonita's performance (establishing a new connection can be relatively costly).

For further information on Tomcat datasource configuration parameters, please refer to the Apache Commons DBCP documentation.

Notifications