Batching
Pipelines automatically batches ingested records. Batching helps reduce the number of output files written to your destination, which can make them more efficient to query.
Batch settings apply after the ingestion stage of a Pipeline. As soon as a batch is filled, the batch of data will be delivered downstream to any transformations you've configured, and then finally to your configured destination.
There are three ways to define how ingested data is batched:
batch-max-mb: The maximum amount of data that will be batched, in megabytes. Default is 10 MB, maximum is 100 MB.batch-max-rows: The maximum number of rows or events in a batch before data is written. Default, and maximum, is 10,000 rows.batch-max-seconds: The maximum duration of a batch before data is written, in seconds. Default is 15 seconds, maximum is 300 seconds.
Pipelines batch definitions are hints. A pipeline will follow these hints closely, but batches will not be exact.
All three batch definitions work together. Whichever limit is reached first triggers the delivery of a batch.
For example, a batch-max-mb = 100 MB and a batch-max-seconds = 600 means that if 100 MB of events are posted to the Pipeline, the batch will be delivered. However, if it takes longer than 600 seconds for 100 MB of events to be posted, a batch of all the messages that were posted during those 300 seconds will be created and delivered.
To update the batch settings for an existing Pipeline using Wrangler, run the following command in a terminal
npx wrangler pipelines update [PIPELINE-NAME] --batch-max-mb 100 --batch-max-rows 10000 --batch-max-seconds 300You can configure the following batch-level settings to adjust how Pipelines create a batch:
| Setting | Default | Minimum | Maximum |
|---|---|---|---|
Maximum Batch Size batch-max-mb | 10 MB | 0.001 MB | 100 MB |
Maximum Batch Timeout batch-max-seconds | 15 seconds | 0 seconds | 300 seconds |
Maximum Batch Rows batch-max-rows | 10,000 rows | 1 row | 10,000 rows |