Thursday, October 6, 2011

Troubleshooting : Coherence Log Message

Configuration Log Messages

Message java.io.IOException: Configuration file is missing: "tangosol-coherence.xml"
Parameters n/a
Severity 1-Error
Cause The operational configuration descriptor cannot be loaded.
Action Make sure that the "tangosol-coherence.xml" resource can be loaded from the class path specified in the Java command line.
Message Loaded operational configuration from resource "%s"
Parameters %s - the full resource path (URI) of the operational configuration descriptor
Severity 3-Informational
Cause The operational configuration descriptor is loaded by Coherence from the specified location.
Action If the location of the operational configuration descriptor was explicitly specified via system properties or programmatically, verify that the reported URI matches the expected location.
Message Loaded operational overrides from "%s"
Parameters %s - the URI (file or resource) of the operational configuration descriptor override
Severity 3-Informational
Cause The operational configuration descriptor points to an override location, from which the descriptor override has been loaded.
Action If the location of the operational configuration descriptor was explicitly specified via system properties, descriptor override or programmatically, verify that the reported URI matches the expected location.
Message Optional configuration override "%s" is not specified
Parameters %s - the URI of the operational configuration descriptor override
Severity 3-Informational
Cause The operational configuration descriptor points to an override location which does not contain any resource
Action Verify, that the operational configuration descriptor override is not supposed to exist

Message java.io.IOException: Document "%s1" is cyclically referenced by the 'xml-override' attribute of element %s2
Parameters %s1 - the URI of the operational configuration descriptor or override; %s2 - the name of the XML element that contains an incorrect reference URI
Severity 1-Error
Cause The operational configuration override points to itself or another override that point to it, creating an infinite recursion
Action Correct the invalid 'xml-override' attribute's value.
Message java.io.IOException: Exception occurred during parsing: %s
Parameters %s - the XML parser error
Severity 1-Error
Cause The specified XML is invalid and cannot be parsed.
Action Correct the XML document.
Message Loaded cache configuration from "%s"
Parameters %s - the URI (file or resource) of the cache configuration descriptor
Severity 3-Informational
Cause The operational configuration descriptor or a programmatically created ConfigurableCacheFactory instance points to a cache configuration descriptor that has been loaded.
Action Verify that the reported URI matches the expected cache configuration descriptor location

Partitioned Cache Service Log Messages

Message Asking member %n1 for %n2 primary partitions
Parameters %n1 - the node id this node asks to transfer partitions from; %n2 - the number of partitions this node is willing to take
Severity 4-Debug Level 4
Cause When a storage-enabled partitioned service starts on a Coherence node, it first receives the configuration update that informs it about other storage-enabled service nodes and the current partition ownership information. That information allows it to calculate the "fair share" of partitions that each node is supposed to own at the end of the re-distribution process. This message demarcates a beginning of the transfer request to a specified node for a number of partitions to move toward the "fair" ownership distribution.
Action None.
Message Transferring %n1 out of %n2 primary partitions to member %n3 requesting %n4
Parameters %n1 - the number of primary partitions this node transferring to a requesting node; %n2 - the total number of primary partitions this node currently owns; %n3 - the node id that this transfer is for; %n4 - the number of partitions that the requesting node asked for
Severity 4-Debug Level 4
Cause During the partition distribution protocol, a node that owns less than a "fair share" of primary partitions requests any of the nodes that own more than the fair share to transfer a portion of owned partitions. The owner may choose to send any number of partitions less or equal to the requested amount. This message demarcates the beginning of the corresponding primary data transfer.
Action None.
Message Transferring %n1 out of %n2 partitions to a machine-safe backup 1 at member %n3 (under %n4)
Parameters %n1 - the number of backup partitions this node transferring to a different node; %n2 - the total number of partitions this node currently owns that are "endangered" (do not have a backup); %n3 - the node id that this transfer is for; %n4 - the number of partitions that the transferee can take before reaching the "fair share" amount
Severity 4-Debug Level 4
Cause After the primary partition ownership is completed, nodes start distributing the backups, ensuring the "strong backup" policy, that places backup ownership to nodes running on machines that are different from the primary owners' machines. This message demarcates the beginning of the corresponding backup data transfer.
Action None.
Message Transferring backup%n1] for partition %n2 from member %n3 to member %n4
Parameters %n1 - the index of the backup partition that this node transferring to a different node; %n2 - the partition number that is being transferred; %n3 the node id of the previous owner of this backup partition; %n4 the node id that the backup partition is being transferred to.
Severity 5-Debug Level 5
Cause During the partition distribution protocol, a node that determines that a backup owner for one of its primary partitions is overloaded may choose to transfer the backup ownership to another, underloaded node. This message demarcates the beginning of the corresponding backup data transfer.
Action None.
Message Failed backup transfer for partition %n1 to member %n2; restoring owner from: %n2 to: %n3
Parameters %n1 the partition number for which a backup transfer was in-progress; %n2 the node id that the backup partition was being transferred to; %n3 the node id of the previous backup owner of the partition
Severity 4-Debug Level 4
Cause This node was in the process of transferring a backup partition to a new backup owner when that node left the service. This node is restoring the backup ownership to the previous backup owner.
Action None.
Message Deferring the distribution due to %n1 pending configuration updates
Parameters %n1
Severity 5-Debug Level 5
Cause This node is in the process of updating the global ownership map (notifying other nodes about ownership changes) when the periodic scheduled distribution check takes place. Before the previous ownership changes (most likely due to a previously completed transfer) are finalized and acknowledged by the other service members, this node will postpone subsequent scheduled distribution checks.
Action None.
Message Limiting primary transfer to %n1 KB (%n2 partitions)
Parameters %n1 - the size in KB of the transfer that was limited; %n2 the number of partitions that were transfered
Severity 4-Debug Level 4
Cause When a node receives a request for some number of primary partitions from an underloaded node, it may transfer any number of partitions (up to the requested amount) to the requestor. The size of the transfer is limited by the configuration element. This message indicates that the distribution algorithm limited the transfer to the specified number of partitions due to the transfer-threshold.
Action None.
Message DistributionRequest was rejected because the receiver was busy. Next retry in %n1 ms
Parameters %n1 - the time in milliseconds before the next distribution check will be scheduled
Severity 6-Debug Level 6
Cause This (underloaded) node issued a distribution request to another node asking for one or more partitions to be transferred. However, the other node declined to initiate the transfer as it was in the process of completing a previous transfer with a different node. This node will wait at least the specified amount of time (to allow time for the previous transfer to complete) before the next distribution check.
Action None.

Message Restored from backup %n1 partitions
Parameters %n1 - the number of partitions being restored
Severity 3-Informational
Cause The primary owner for some backup partitions owned by this node has left the service. This node is restoring those partitions from backup storage (assuming primary ownership). This message is followed by a list of the partitions that are being restored.
Action None.
Message Re-publishing the ownership for partition %n1 (%n2)
Parameters %n1 the partition number whose ownership is being re-published; %n2 the node id of the primary partition owner, or 0 if the partition is orphaned
Severity 4-Debug Level 4
Cause This node is in the process of transferring a partition to another node when a service membership change occurred, necessitating redistribution. This message indicates this node re-publishing the ownership information for the partition whose transfer is in-progress.
Action None.
Message %n1> Ownership conflict for partition %n2 with member %n3 (%n4!=%n5)
Parameters %n1 - the number of attempts made to resolve the ownership conflict; %n2 - the partition whose ownership is in dispute; %n3 - the node id of the service member in disagreement about the partition ownership; %n4 - the node id of the partition's primary owner in this node's ownership map; %n5 - the node id of the partition's primary owner in the other node's ownership map
Cause If a service membership change occurs while the partition ownership is in-flux, it is possible for the ownership to become transiently out-of-sync and require reconciliation. This message indicates that such a conflict was detected, and denotes the attempts to resolve it.
Severity 4-Debug Level 4
Action None.
Message Assigned %n1 orphaned primary partitions
Parameters %n1 - the number of orphaned primary partitions that were re-assigned
Severity 2-Warning
Cause This service member (the most senior storage-enabled) has detected that one or more partitions have no primary owner (orphaned), most likely due to several nodes leaving the service simultaneously. The remaining service members agree on the partition ownership, after which the storage-senior assigns the orphaned partitions to itself. This message is followed by a list of the assigned orphan partitions. This message indicates that data in the corresponding partitions may have been lost.
Action None.
Message validatePolls: This service timed-out due to unanswered handshake request. Manual intervention is required to stop the members that have not responded to this Poll
Parameters none
Severity 1-Error
Cause When a node joins a clustered service, it performs a handshake with each clustered node running the service. A missing handshake response prevents this node from joining the service. Most commonly, it is caused by an unresponsive (e.g. deadlocked) service thread.
Action Corrective action may require locating and shutting down the JVM running the unresponsive service. See Metalink Note 845363.1 for more details.
Message java.lang.RuntimeException: Storage is not configured
Parameters None
Severity 1-Error
Cause A cache request was made on a service that has no storage-enabled service members. Only storage-enabled service members may process cache requests, so there must be at least one storage-enabled member.
Action Check the configuration/deployment to ensure that members intended to store cache data are configured to be storage-enabled. This is controlled by the configuration element, or by the -Dtangosol.coherence.distributed.localstorage command-line override.
Message An entry was inserted into the backing map for the partitioned cache "%s" that is not owned by this member; the entry will be removed."
Parameters %s - the name of the cache into which insert was attempted
Severity 1-Error
Cause The backing map for a partitioned cache may only contain keys that are owned by that member. Cache requests are routed to the service member owning the requested keys, ensuring that service members will only process requests for keys which they own. This message indicates that the backing map for a cache detected an insertion for a key which is not owned by the member. This is most likely caused by a direct use of the backing-map as opposed to the exposed cache APIs (e.g.
${xhtml}) in user code running on the cache server. This message is followed by a Java exception stack trace showing where the insertion was made.
Action Examine the user-code implicated by the stack-trace to ensure that any backing-map operations are safe. This error can be indicative of an incorrect implementation of ${xhtml}
Message Exception occured during filter evaluation: %s; removing the filter...
Parameters %s - the description of the filter that failed during evaluation
Severity 1-Error
Cause An exception was thrown while evaluating a filter for a ${xhtml} registered on this cache. As a result, some MapEvents may not have been issued. Additionally, to prevent further failures, the filter (and associated MapListener) will be removed. This message is followed by a Java exception stack trace showing where the failure occurred.
Action Review filter implementation and the associated stack trace for errors.
Message Exception occured during event transformation: %s; removing the filter...
Parameters %s - the description of the filter that failed during event transformation
Severity 1-Error
Cause An Exception was thrown while the specified filter was transforming a ${xhtml} for a ${xhtml} registered on this cache. As a result, some MapEvents may not have been issued. Additionally, to prevent further failures, the filter (and associated MapListener) will be removed. This message is followed by a Java exception stack trace showing where the failure occurred.
Action Review filter implementation and the associated stack trace for errors.
Message Exception occurred during index rebuild: %s
Parameters %s - the stack trace for the exception that occurred during index rebuild*
Severity 1-Error
Cause An Exception was thrown while adding or rebuilding an index. A likely cause of this is a faulty ${xhtml} implementation. As a result of the failure, the associated index is removed. This message is followed by a Java exception stack trace showing where the failure occurred.
Action Review the ValueExtractor implementation and associated stack trace for errors.
Message Exception occurred during index update: %s
Parameters %s - the stack trace for the exception that occurred during index update
Severity 1-Error
Cause An Exception was thrown while updating an index. A likely cause of this is a faulty ${xhtml} implementation. As a result of the failure, the associated index is removed. This message is followed by a Java exception stack trace showing where the failure occurred.
Action Review the ValueExtractor implementation and associated stack trace for errors.
Message Exception occurred during query processing: %s
Parameters %s - the stack trace for the exception that occurred while processing a query
Severity 1-Error
Cause An Exception was thrown while processing a query. A likely cause of this is an error in the implementation of the ${xhtml} used by the query. This message is followed by a Java exception stack trace showing where the failure occurred.
Action Review the Filter implementation and associated stack trace for errors.
Message BackingMapManager %s1: returned "null" for a cache: %s2
Parameters %s1 - the classname of the ${xhtml} implementation that returned a null backing-map; %s2 - the name of the cache for which the BackingMapManager returned null
Severity 1-Error
Cause A BackingMapManager returned null for a backing-map for the specified cache.
Action Review the specified BackingMapManager implementation for errors and to ensure that it will properly instantiate a backing map for the specified cache.
Message BackingMapManager %s1: failed to instantiate a cache: %s2
Parameters %s1 - the classname of the ${xhtml} implementation that failed to create a backing-map; %s2 - the name of the cache for which the BackingMapManager failed
Severity 1-Error
Cause A BackingMapManager unexpectedly threw an Exception while attempting to instantiate a backing-map for the specified cache.
Action Review the specified BackingMapManager implementation for errors and to ensure that it will properly instantiate a backing map for the specified cache.
Message BackingMapManager %s1: failed to release a cache: %s2
Parameters %s1 - the classname of the ${xhtml} implementation that failed to release a backing-map; %s2 - the name of the cache for which the BackingMapManager failed
Severity 1-Error
Cause A BackingMapManager unexpectedly threw an Exception while attempting to release a backing-map for the specified cache.
Action Review the specified BackingMapManager implementation for errors and to ensure that it will properly release a backing map for the specified cache.
Message Unexpected event during backing map operation: key=%s1; expected=%s2; actual=%s3
Parameters %s1 - the key being modified by the cache; %s2 - the expected backing-map event from the cache operation in progress; %s3 - the actual MapEvent received
Severity 6-Debug Level 6
Cause While performing a cache operation, an unexpected MapEvent was received on the backing-map. This indicates that a concurrent operation was performed directly on the backing-map and is most likely caused by direct manipulation of the backing-map as opposed to the exposed cache APIs (e.g. ${xhtml}) in user code running on the cache server.
Action Examine any user-code that may directly modify the backing map to ensure that any backing-map operations are safe.
Message Application code running on "%s1" service thread(s) should not call %s2 as this may result in deadlock. The most common case is a CacheFactory call from a custom CacheStore implementation.
Parameters %s1 - the name of the service which has made a re-entrant call; %s2 - the name of the method on which a re-entrant call was made
Severity 2
Cause While executing application code on the specified service, a re-entrant call (a request to the same service) was made. Coherence does not support re-entrant service calls, so any application code (CacheStore, EntryProcessor, etc.) running on the service thread(s) should avoid making cache requests. See the Constraints on Re-entrant Calls for more details.
Action Remove re-entrant calls from application code running on the service thread(s) and consider using alternative design strategies as outlined in the Constraints on Re-entrant Calls
Message Repeating %s1 for %n1 out of %n2 items due to re-distribution of %s2
Parameters %s1 - the description of the request that must be repeated; %n1 - the number of items that are outstanding due to re-distribution; %n2 - the total number of items requested; %s2 - the list of partitions that are in the process of re-distribution and for which the request must be repeated
Severity 5-Debug Level 5
Cause When a cache request is made, the request is sent to the service members owning the partitions to which the request refers. If one or more of the partitions that a request refers to is in the process of being transferred (e.g. due to re-distribution), the request is rejected by the (former) partition owner and is automatically resent to the new partition owner.
Action None.
Message Error while starting cluster: com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(%s)
Parameters %s - information on the service that could not be started
Severity 1-Error
Cause When joining a service, every service in the cluster must respond to the join request. If one or more nodes have a service that does not respond within the timeout period, the join times out.
Action See Metalink Note 845363.1
Message Failed to restart services: com.tangosol.net.RequestTimeoutException: Timeout during service start: ServiceInfo(%s)
Parameters %s - information on the service that could not be started
Severity 1-Error
Cause When joining a service, every service in the cluster must respond to the join request. If one or more nodes have a service that does not respond within the timeout period, the join times out.
Action See Metalink Note 845363.

TCMP Log Messages


Message
Experienced a %n1 ms communication delay (probable remote GC) with Member %s
Parameters
%n1 - the latency in milliseconds of the communication delay; %s the full Member information
Severity
2-Warning or 5-Debug Level 5 or 6-Debug Level 6 depending on the length of the delay
Cause
This node detected a delay in receiving acknowledgment packets from the specified node, and has determined that is it likely due to a remote GC (rather than a local GC). This message indicates that the overdue acknowledgment has been received from the specified node, and that it has likely emerged from its GC.
Action
Prolonged and frequent GC's can adversely affect cluster performance and availability. If these warnings are seen frequently, review your JVM heap and GC configuration and tuning. See the performance tuning guide for more details.

Message
Failed to satisfy the variance: allowed=%n1 actual=%n2
Parameters
%n1 - the maximum allowed latency in milliseconds; %n2 - the actual latency in milliseconds
Severity
3-Informational or 5-Debug Level 5 depending on the message frequency
Cause
One of the first steps in the Coherence cluster discovery protocol is the calculation of the clock difference between the new and the senior nodes. This step assumes a relatively small latency for peer-to-peer round trip UDP communications between the nodes. By default, the configured maximum allowed latency (the value of the "maximum-time-variance" configuration element) is 16 milliseconds. Failure to satisfy that latency causes this message to be logged and increases the latency threshold, which will be reflected in a follow up message.
Action
If the latency consistently stays very high (over 100 milliseconds), consult your network administrator and run the Datagram Test.

Message
Created a new cluster "%s1" with Member(%s2)
Parameters
%s1 - the cluster name; %s2 - the full Member information
Severity
3-Informational
Cause
This Coherence node attempted to join an existing cluster the configured amount of time (specified by the "multicast-listener/join-timeout-milliseconds" element), but did not receive any responses from any other node. As a result, it created a new cluster with the specified name (either configured by the "member-identity/cluster-name" element or calculated based on the multicast listener address and port or the "well-known-address" list). The Member information includes the node id, creation timestamp, unicast address and port, location, process id, role, etc.)
Action
None, if this node is expected to be the first node in the cluster. Otherwise, the operational configuration has to be reviewed to determine the reason that this node does not join the existing cluster.

Message
This Member(%s1) joined cluster "%s2" with senior Member(%s3)
Parameters
%s1 - the full Member information for this node; %s2 - the cluster name; %s3 - the full Member information for the cluster senior node
Severity
3-Informational
Cause
This Coherence node has joined an existing cluster.
Action
None, if this node is expected to join an existing cluster. Otherwise, identify the running cluster and consider corrective actions.

Message
Member(%s) joined Cluster with senior member %n
Parameters
%s - the full Member information for a new node that joined the cluster this node belongs to; %n - the node id of the cluster senior node
Severity
5-Debug Level 5
Cause
A new node has joined an existing Coherence cluster.
Action
None.

Message
Member(%s) left Cluster with senior member %n
Parameters
%s - the full Member information for a node that left the cluster; %n - the node id of the cluster senior node
Severity
5-Debug Level 5
Cause
A node has left the cluster. This departure could be caused by the programmatic shutdown, process termination (normal or abnormal), or any other communication failure (e.g. a network disconnect or a very long GC pause). This message reports the node's departure.
Action
None, if the node departure was intentional. Otherwise, the departed node logs should be analyzed.

Message
MemberLeft notification for Member %n received from Member(%s)
Parameters
%n - the node id of the departed node; %s - the full Member information for a node that left the cluster
Severity
5-Debug Level 5
Cause
When a Coherence node terminates, this departure is detected by nodes earlier than others. Most commonly, a node connected via the TCP ring connection ("TCP ring buddy") would be the first to detect it. This message provides the information about the node that detected the departure first.
Action
None, if the node departure was intentional. Otherwise, the logs for both the departed and the detecting nodes should be analyzed.
Message
Service %s joined the cluster with senior service member %n
Parameters
%s - the service name; %n - the senior service member id
Severity
5-Debug Level 5
Cause
When a clustered service starts on a given node, Coherence initiates a handshake protocol between all cluster nodes running the specified service. This message serves as an indication that this protocol has been initiated. If the senior node is not known at this time, it will be shown as "n/a".
Action
None.

Message

This node appears to have partially lost the connectivity: it receives responses from MemberSet(%s1) which communicate with Member(%s2), but is not responding directly to this member; that could mean that either requests are not coming out or responses are not coming in; stopping cluster service.
Parameters
%s1 - set of members that can communicate with the member indicated in %s2; %s2 - member that can communicate with set of members indicated in %s1
Severity
1-Error
Cause
The communication link between this member and the member indicated by %s2 has been broken. However, the set of witnesses indicated by %s1 report no communication issues with %s2. It is therefore assumed that this node is in a state of partial failure, thus resulting in the shutdown of its cluster threads.
Action
Corrective action is not necessarily required, since the rest of the cluster presumably is continuing its operation and this node may recover and rejoin the cluster. On the other hand, it may warrant an investigation into root causes of the problem (especially if it is recurring with some frequency).
Message

validatePolls: This senior encountered an overdue poll, indicating a dead member, a significant network issue or an Operating System threading library bug (e.g. Linux NPTL): Poll
Parameters
none
Severity
2-Warning
Cause
When a node joins a cluster, it performs a handshake with each cluster node. A missing handshake response prevents this node from joining the service. The log message following this one will indicate the corrective action taken by this node.
Action
If this message reoccurs, further investigation into the root cause may be warranted.

Message

Received panic from senior Member(%s1) caused by Member(%s2)
Parameters
%s1 - the cluster senior member as known by this node; %s2 - a member claiming to be the senior member
Severity
1-Error
Cause
This occurs after a cluster is split into multiple cluster islands (usually due to a network link failure.) When a link is restored and the corresponding island seniors see each other, the panic protocol is initiated to resolve the conflict.
Action
If this issue occurs frequently, the root cause of the cluster split should be investigated.

Message
Member %n1 joined Service %s with senior member %n2
Parameters
%n1 - an id of the Coherence node that joins the service; %s - the service name; %n2 - the senior node for the service
Severity
5-Debug Level 5
Cause
When a clustered service starts on any cluster node, Coherence initiates a handshake protocol between all cluster nodes running the specified service. This message serves as an indication that the specified node has successfully completed the handshake and joined the service.
Action
None.
Message

Member %n1 left Service %s with senior member %n2
Parameters
%n1 - an id of the Coherence node that joins the service; %s - the service name; %n2 - the senior node for the service
Severity
5-Debug Level 5
Cause
When a clustered service terminates on some cluster node, all other nodes that run this service are notified about this event. This message serves as an indication that the specified clustered service at the specified node has terminated.
Action
None.

Message

Service %s: received ServiceConfigSync containing %n entries
Parameters
%s - the service name; %n - the number of entries in the service configuration map
Severity
5-Debug Level 5
Cause
As a part of the service handshake protocol between all cluster nodes running the specified service, the service senior member updates every new node with the full content of the service configuration map. For the partitioned cache services that map includes the full partition ownership catalog and internal ids for all existing caches. That same message is sent in the case of an abnormal service termination at the senior node, when a new node assumes the service seniority. This message serves as an indication that the specified node has received that configuration update.
Action
None.
Message

TcpRing: connecting to member %n using TcpSocket{%s}
Parameters
%s - the full information for the TcpSocket that serves as a TcpRing connector to another node; %n - the node id to which this node has connected
Severity
5-Debug Level 5
Cause
For quick process termination detection Coherence utilizes a feature called TcpRing, which is a sparse collection of TCP/IP-based connection between different nodes in the cluster. Each node in the cluster is connected to at least one other node, which (if at all possible) is running on a different physical box. This connection is not used for any data transfer; only trivial "heartbeat" communications are sent once a second per each link. This message indicates that the connection between this and specified node is initialized.
Action
None.

Message

Rejecting connection to member %n using TcpSocket{%s}
Parameters
%n - the node id that tries to connect to this node; %s - the full information for the TcpSocket that serves as a TcpRing connector to another node
Severity
4-Debug Level 4
Cause
Sometimes the TCP Ring daemons running on different nodes could attempt to join each other or the same node at the same time. In this case, the receiving node may determine that such a connection would be redundant and reject the incoming connection request. This message is logged by the rejecting node when this happens.
Action
None.
Message

Timeout while delivering a packet; requesting the departure confirmation for Member(%s1) by MemberSet(%s2)
Parameters
%s1 - the full Member information for a node that this node failed to communicate with; %s2 - the full information about the "witness" nodes that are asked to confirm the suspected member departure
Severity
2-Warning
Cause
Coherence uses UDP for all data communications (mostly peer-to-peer unicast), which by itself does not have any delivery guarantees. Those guarantees are built into the cluster management protocol used by Coherence (TCMP). The TCMP daemons are responsible for acknowledgment (ACK or NACK) of all incoming communications. If one or more packets are not acknowledged within the ACK interval ("ack-delay-milliseconds"), they are resent. This repeats until the packets are finally acknowledged or the timeout interval elapses ("timeout-milliseconds"). At this time, this message is logged and the "witness" protocol is engaged, asking other cluster nodes whether or not they experience similar communication delays with the non-responding node. The witness nodes are chosen based on their roles and location.
Action
Corrective action is not necessarily required, since the rest of the cluster presumably is continuing its operation and this node may recover and rejoin the cluster. On the other hand, it may warrant an investigation into root causes of the problem (especially if it is recurring with some frequency).

Message

This node appears to have become disconnected from the rest of the cluster containing %n nodes. All departure confirmation requests went unanswered. Stopping cluster service.
Parameters
%n - the number of other nodes in the cluster this node was a member of
Severity
1-Error
Cause
Sometime a node that lives within a valid Java process, stops communicating to other cluster nodes. (Possible reasons include: a) network failure; b) extremely long GC pause; c) swapped out process. ) In that case, other cluster nodes may choose to revoke the cluster membership fro the paused node and completely shun any further communication attempts by that node, causing this message be logged when the process attempts to resume cluster communications.
Action
Corrective action is not necessarily required, since the rest of the cluster presumably is continuing its operation and this node may recover and rejoin the cluster. On the other hand, it may warrant an investigation into root causes of the problem (especially if it is recurring with some frequency).
Message

A potential communication problem has been detected. A packet has failed to be delivered (or acknowledged) after %n1 seconds, although other packets were acknowledged by the same cluster member (Member(%s1)) to this member (Member(%s2)) as recently as %n2 seconds ago. Possible causes include network failure, poor thread scheduling (see FAQ if running on Windows), an extremely overloaded server, a server that is attempting to run its processes using swap space, and unreasonably lengthy GC times.
Parameters
%n1 - The number of seconds a packet has failed to be delivered or acknowledged; %s1 - the recipient of the packets indicated in the message; %s2 - the sender of the packets indicated in the message; %n2 - the number of seconds since a packet was delivered successfully between the two members indicated above
Severity
2-Warning
Cause
Possible causes are indicated in the text of the message.
Action
If this issue occurs frequently, the root cause should be investigated.

Message

Node %s1 is not allowed to create a new cluster; WKA list: [%s2]
Parameters
%s1 - Address of node attempting to join cluster; %s2 - List of WKA addresses
Severity
1-Error
Cause
The cluster is configured to use WKA, and there are no nodes present in the cluster that are in the WKA list.
Action
Ensure that at least one node in the WKA list exists in the cluster, or add this node's address to the WKA list.
Message

This member is configured with a compatible but different WKA list then the senior Member(%s). It is strongly recommended to use the same WKA list for all cluster members.
Parameters
%s - the senior node of the cluster
Severity
2-Warning
Cause
The WKA list on this node is different than the WKA list on the senior node.
Action
Ensure that every node in the cluster has the same WKA list.

Message

UnicastUdpSocket failed to set receive buffer size to %n1 packets (%n2 bytes); actual size is %n3 packets (%n4 bytes). Consult your OS documentation regarding increasing the maximum socket buffer size. Proceeding with the actual value may cause sub-optimal performance.
Parameters
%n1 - the number of packets that will fit in the buffer that Coherence attempted to allocate; %n2 - the size of the buffer Coherence attempted to allocate; %n3 - the number of packets that will fit in the actual allocated buffer size; %n4 - the actual size of the allocated buffer
Severity
2-Warning
Cause
Action

1 comment:

Blogger said...

Did you know that you can make money by locking selected pages of your blog or site?
All you need to do is open an account with AdscendMedia and use their Content Locking widget.

Search This Blog