Writer Order and FME Performance
From Fmepedia
| Table of contents |
Overview
The issue of writing order is interesting because it's a hidden performance improvement that can cause strange looking results but can equally be twisted to the user's advantage. The following explanation might be a bit generalized - or even technically incorrect - in some places, but the overall gist is of a feature of FME well worth knowing.
How FME Multi-Writing Works
In the beginning FME could only write to one output dataset in any process, but was soon updated to permit multiple output datasets in multiple formats.
With no restrictions on making connections with a workspace this means that, with a workspace containing multiple streams, data can arrive at any writer in any order. But FME can't open multiple writers at the same time, so it deals with this by storing (caching) all data as it arrives at the output stage. Once the workspace transformers are complete, and all the data has arrived at the writers, FME opens the writers one by one and outputs the data.
So writer 1 is opened and data written, writer 1 is closed, writer 2 is opened and data written, writer 2 is closed, writer 3 is opened etc etc
How FME Multi-Writing Was Improved
This situation works well enough, but then someone realized that a performance improvement could be made by opening writer 1 as soon as the workspace is started. Then features could be written to dataset 1 as soon as they arrived at the output stage, and not need to be cached.
Features for dataset 2, 3, 4 etc would still need to be cached, because FME can't have multiple writers open, but at least the first writer will be quicker, and in workspaces where there is only one writer, the whole process will be quicker.
How this Improvement Causes Strange Results
Not caching dataset 1 is all very well, but it can lead to results that appear strange to even the most expert user.
Example 1
Check out this example workspace.

Above: This is all very simple. 100 features are generated and dispatched to either of the two writers (according to where the link is connected).

Above: With the link connected to the first writer the process takes 18 seconds.

Above: With the link connected to the second writer the process takes 35 seconds.
How can this be? In each case the same data is written to the same format, but the second set of results are much slower. Of course, this is caused by the first writer not being cached, but being written directly. Although it looks bad, the difference in time is not a problem with the second writer, but rather the performance improvement brought about by the bright idea of not having to cache dataset 1.
NB: ignore the fact that the log file 1 says "dataset 2" and vice versa. I got my feature type names mixed up and forgot to redo the screenshots!
How this Improvement Can be Used to Your Advantage
So, writing to dataset 1 is faster - which is great in itself - but if you also happened to know that you can alter the order of writers within a workspace, then you might realize that you now have control over which dataset gets this performance boost!
So which dataset will you promote to this position? The one that has the largest amount of data - because it will give the greatest reduction in number of features cached.
Example 2
Check out this example workspace.

Above: Another simple workspace. One Creator generates 100,000 features and sends them to a writer. A second Creator generates just 1 feature and sends that to another writer.

Above: At first the writer receiving just one feature is top of the dataset list

Above: With this configuration the overall process takes 22 seconds
However, writer order can be controlled by by right-clicking a dataset in the navigator pane and using the option to "move up in list".

Above: Continuing the example, the FME user reads this page and, realizing the importance, promotes the larger dataset...

Above: ...so now the writer receiving 100,000 features is top of the dataset list

Above: With this configuration the overall process takes just 12 seconds!
So that performance gain is the difference between having to cache 100,000 features and 1 feature. You can use your imagination to think what difference this might make when you are processing millions of features. Also - a bigger improvement will occur the more complex the data is. A set of multiple geometries with many attributes would take more time and memory to handle than the single point features in the above example.
In the past I've had project workspaces where I write all my data to one SDE dataset and have a null dataset set up to log one or two invalid features. It makes me wonder if I was getting inferior performance because the null dataset was the first writer!
Also, all this happens even if the first dataset is disabled; i.e. the disabled dataset is still the un-cached one! Also many users add datasets, delete the feature types but leave the dataset itself intact. This too could be reducing efficiency. So if you have writers that you aren't using any more, the safest course is to delete them, or at least demote them to the bottom of the dataset list.
Why this Information might be Redundant in the Future
This is speculation on my part, but I know our developers are working on making FME multi-threaded. If this occurs then I'm thinking that each writer will be spun off as a different thread and so caching would not be required at all - hence all this would be irrelevant.
