Temporal Platform knowledge base | Temporal Documentation

Troubleshoot the "failed reaching server" error

June 5, 2023

The message Failed reaching server: last connection error can often result from an expired TLS certificate or during the Server startup process, in which the Client requests reach the Server before the roles are fully initialized.

This troubleshooting guide shows you how to do the following:

Verify the certification expiration date
Renew the certification
Update the server configuration

Troubleshoot deadline-exceeded error

May 22, 2023

All requests made to the Temporal Cluster by the Client or Worker are gRPC requests. Sometimes, when these frontend requests can't be completed, you'll see this particular error message: Context: deadline exceeded. Network interruptions, timeouts, server overload, and Query errors are some of the causes of this error.

The following sections discuss the nature of this error and how to troubleshoot it.

Temporal Platform limits sheet

April 28, 2023

Running into limits can cause unexpected failures. Knowing the limits of Temporal can prevent that.

This page details many of the errors and warnings coded into the Temporal Platform. Errors are hard limits that fail when reached. Warnings are soft limits that produce a warning log on the server side.

Legacy OSS Temporal Server self-hosted production deployment guide

April 25, 2023

note

The information in this page is being dispersed into Knowledge base articles, Cluster concept guide, and the Cluster deployment guide.

Temporal Server

Temporal Server is a Go application which you can import or run as a binary (we offer builds with every release).

If you are running only the Go binary, Go is not required.

But if you are building Temporal or running it from source, Go v1.17+ is required.

While Temporal can be run as a single Go binary, we recommend that production deployments of Temporal Server should deploy each of the 4 internal services separately (if you are using Kubernetes, one service per pod) so they can be scaled independently in the future.

See Clusters concept guide for more info.

In practice, this means you will run each container with a flag specifying each service, e.g.

docker run
    # persistence/schema setup flags omitted
    -e SERVICES=history \                      -- Spinup one or more of: history, matching, worker, frontend
    -e LOG_LEVEL=debug,info \                           -- Logging level
    -e DYNAMIC_CONFIG_FILE_PATH=config/foo.yaml         -- Dynamic config file to be watched
    temporalio/server:<tag>

See the Docker source file for more details.

Each release also ships a Server with Auto Setup Docker image that includes an auto-setup.sh script we recommend using for initial schema setup of each supported database. You should familiarize yourself with what auto-setup does, as you will likely be replacing every part of the script to customize for your own infrastructure and tooling choices.

Though neither are blessed for production use, you can consult our Docker-Compose repo or Helm Charts for more hints on configuration options.

The information in this page is being dispersed into Knowledge base articles, Cluster concept guide, and the Cluster deployment guide.

:::

Minimum Requirements

The minimum Temporal Server dependency is a database. We support Cassandra, MySQL, or PostgreSQL, with SQLite on the way.
Further dependencies are only needed to support optional features. For example, enhanced Workflow search can be achieved using Elasticsearch.
Monitoring and observability are available with Prometheus and Grafana.
Each language SDK also has minimum version requirements. See the versions and dependencies page for precise versions we support together with these features.

Kubernetes is not required for Temporal, but it is a popular deployment platform anyway. We do maintain a Helm chart you can use as a reference, but you are responsible for customizing it to your needs.

Configuration

At minimum, the development.yaml file needs to have the global and persistence parameters defined.

The Server configuration reference has a more complete list of possible parameters.

Before you deploy: Reminder on shard count

A huge part of production deploy is understanding current and future scale - the number of shards can't be changed after the cluster is in use so this decision needs to be upfront. Shard count determines scaling to improve concurrency if you start getting lots of lock contention. The default numHistoryShards is 4; deployments at scale can go up to 500-2000 shards. Please consult our configuration docs and check with us for advice if you are worried about scaling.

Scaling and Metrics

The requirements of your Temporal system will vary widely based on your intended production workload. You will want to run your own proof of concept tests and watch for key metrics to understand the system health and scaling needs.

Configure your metrics subsystem. Temporal supports three metrics providers out of the box via Uber's Tally interface: StatsD, Prometheus, and M3. Tally offers extensible custom metrics reporting, which we expose via temporal.WithCustomMetricsReporter. OpenTelemetry support is planned in the future.
Set up monitoring. You can use these Grafana dashboards as a starting point. The single most important metric to track is schedule_to_start_latency - if you get a spike in workload and don't have enough workers, your tasks will get backlogged. We strongly recommend setting alerts for this metric. This is usually emitted in client SDKs as both temporal_activity_schedule_to_start_latency_* and temporal_workflow_task_schedule_to_start_latency_* variants - see the Prometheus GO SDK example and the Go SDK source and there are plans to add it on the Server.
- Set up alerts for Workflow Task failures.
- Also set up monitoring/alerting for all Temporal Workers for standard metrics like CPU/Memory utilization.
Load testing. You can use the Maru benchmarking tool (author's guide here), see how we ourselves stress test Temporal, or write your own.

All metrics emitted by the server are listed in Temporal's source. There are also equivalent metrics that you can configure from the client side. At a high level, you will want to track these 3 categories of metrics:

Service metrics: For each request made by the service handler we emit service_requests, service_errors, and service_latency metrics with type, operation, and namespace tags. This gives you basic visibility into service usage and allows you to look at request rates across services, namespaces and even operations.
Persistence metrics: The Server emits persistence_requests, persistence_errors and persistence_latency metrics for each persistence operation. These metrics include the operation tag such that you can get the request rates, error rates or latencies per operation. These are super useful in identifying issues caused by the database.
Workflow Execution stats: The Server also emits counters for when Workflow Executions are complete. These are useful in getting overall stats about Workflow Execution completions. Use workflow_success, workflow_failed, workflow_timeout, workflow_terminate and workflow_cancel counters for each type of Workflow Execution completion. These include the namespace tag.

Please request any additional information in our forum. Key discussions are here:

Checklist for Scaling Temporal

Temporal is highly scalable due to its event sourced design. We have load tested up to 200 million concurrent Workflow Executions. Every shard is low contention by design, and it is very difficult to oversubscribe to a Task Queue in the same cluster. With that said, here are some guidelines to some common bottlenecks:

Database. The vast majority of the time the database will be the bottleneck. We highly recommend setting alerts on schedule_to_start_latency to look out for this. Also check if your database connection is getting saturated.
Internal services. The next layer will be scaling the 4 internal services of Temporal (Frontend, Matching, History, and Worker). Monitor each accordingly. The Frontend Service is more CPU bound, whereas the History and Matching Services require more memory. If you need more instances of each service, spin them up separately with different command line arguments. You can learn more cross referencing our Helm chart with our Server Configuration reference.
See Platform limits for other limits you will want to keep in mind when doing system design, including event history length.

Please see the dedicated docs on Tuning and Scaling Workers.

FAQs

FAQ: Autoscaling Workers based on Task Queue load

Temporal does not yet support returning the number of tasks in a task queue. The main technical hurdle is that each task can have its own ScheduleToStart timeout, so just counting how many tasks were added and consumed is not enough.

This is why we recommend tracking schedule_to_start_latency for determining if the task queue has a backlog (aka your Workflow and Activity Workers are under-provisioned for a given Task Queue). We do plan to add features that give more visibility into the task queue state in the future.

FAQ: High Availability cluster configuration

You can set up a high availability deployment by running more than one instance of the server. Temporal also handles membership and routing. You can find more details in the clusterMetadata section of the Server Configuration reference.

clusterMetadata:
  enableGlobalNamespace: false
  failoverVersionIncrement: 10
  masterClusterName: "active"
  currentClusterName: "active"
  clusterInformation:
    active:
      enabled: true
      initialFailoverVersion: 0
      rpcAddress: "127.0.0.1:7233"

FAQ: Multiple deployments on a single cluster

You may sometimes want to have multiple parallel deployments on the same cluster, for example:

when you want to split Temporal deployments based on namespaces, e.g. staging/dev/uat, or for different teams who need to share common infrastructure.
when you need a new deployment to change numHistoryShards.

You can skip the following procedure if your server is running v1.19 or later. The v1.19 release ensures that the membership from different clusters does not combine.

We recommend not doing this if you can avoid it. If you need to do it anyway, double-check the following:

Have a separate persistence (database) for each deployment
Cluster membership ports should be different for each deployment (they can be set through environment variables). For example:
- Temporal1 services can have 7233 for frontend, 7234 for history, 7235 for matching
- Temporal2 services can have 8233 for frontend, 8234 for history, 8235 for matching
There is no need to change gRPC ports.

More details about the reason here.

Securing Temporal

Please see our dedicated docs on Temporal Server Security.

Debugging Temporal

Debugging Temporal Server Configs

Recommended configuration debugging techniques for production Temporal Server setups:

Containers (to be completed)
Storage (to be completed)
Networking
- Temporal cluster unable to establish ring membership, causing an infinite crash loop: Use tcurl to audit it

Debugging Workflows

We recommend using Temporal Web to debug your Workflow Executions in development and production.

Tracing Workflows

Temporal Web's tracing capabilities mainly track activity execution within a Temporal context. If you need custom tracing specific for your usecase, you should make use of context propagation to add tracing logic accordingly.

Example: Tracing Temporal Workflows with DataDog

Further things to consider

danger

This document is still being written, and we would welcome your questions and contributions.

Please search for these topics in our forum or ask on Slack.

Temporal Antipatterns

Please request elaboration on any of these.

Trying to implement a queue in a Workflow (because people hear we replace queues)
Serializing massive amounts of state into and out of the workflow.
Treating everything as rigid/linear sequence of steps instead of dynamic logic
Implementing a DSL which is actually just a generic schema based language
Polling in activities instead of using signals
Blocking on incredibly long RPC requests and not using heartbeats
Failing/retrying workflows without a specific business reason

Temporal Best practices

Please request elaboration on any of these.

Mapping things to entities instead of traditional service design
Testing: unit, integration
Retries: figuring out right values for timeouts
Versioning
The Workflow is Temporal's fundamental unit of scalability - break things into workflows to scale, don't try to stuff everything in one workflow!

External Runbooks

Third party content that may help:

Recommended Setup for Running Temporal with Cassandra on Production (Temporal Forums)
How To Deploy Temporal to Azure Container Instances
How To Deploy Temporal to Azure Kubernetes Service (AKS)
AWS ECS runbook (we are seeking external contributions, please let us know if you'd like to work on this)
AWS EKS runbook (we are seeking external contributions, please let us know if you'd like to work on this)

Temporal Cloud Namespace naming considerations

April 20, 2023

This article provides general guidance for organizing Namespaces across use cases, services, applications, or domains. Temporal Cloud provides Namespace–as-a-service, so the Namespace is the endpoint. Customers should consider not only a Namespace naming convention but also how to group or isolate workloads using the Namespace as a boundary.

An opinionated guide to productionizing Workflows

March 31, 2023

You've learned about Temporal, checked out our samples, written a few Workflows, and now you're ready to productionize. In this article, we outline some techniques you can employ to ensure that your Workflows are ready for the future.

All the ways to run a Temporal Cluster

March 16, 2023

There are many ways to run a Temporal Cluster on your own. However, the right way for you depends entirely on your use case and where you plan to run it. This article aims to maintain a comprehensive list of all the ways we know of.

Python sandbox environment

January 8, 2023

Python sandbox

The Temporal Python SDK enables you to run Workflow code in a sandbox environment to help prevent non-determinism errors in your application. The Temporal Workflow Sandbox for Python is not completely isolated, and some libraries can internally mutate state, which can result in breaking determinism.

How to explain Temporal

December 12, 2022

There are many ways to introduce and teach Temporal based on your background. Temporal doesn't have a monopoly on explaining Temporal.

Set up Grafana with Temporal Cloud observability to view metrics

December 2, 2022

Temporal Cloud and SDKs emit metrics that can be used to monitor performance and troubleshoot errors.

Temporal Cloud emits metrics through a Prometheus HTTP API endpoint which can be directly used as a Prometheus data source in Grafana or to query and export Cloud metrics to any observability platform.

Temporal Server​

Minimum Requirements​

Configuration​

Scaling and Metrics​

Checklist for Scaling Temporal​

FAQs​

FAQ: Autoscaling Workers based on Task Queue load​

FAQ: High Availability cluster configuration​

FAQ: Multiple deployments on a single cluster​

Securing Temporal​

Debugging Temporal​

Debugging Temporal Server Configs​

Debugging Workflows​

Tracing Workflows​

Further things to consider​

Temporal Antipatterns​

Temporal Best practices​

External Runbooks​