I've been using Google's MapReduce for AppEngine for some time. It uses the Pipeline API to connect its map, shuffle and reduce phases. It gave me ideas.
I grabbed that pipeline API and implemented a denormalizing pipeline. It'd receive data from each table from a relational database and denormalize the data to Appengine's Datastore (non relational). What the pipeline would help with is wait for the missing table so it could do the joins to complete the denormalization. It worked, but datastore writes were soaring, quickly making my app hit the daily budget.
I decided to run some tests. Yes, I should've ran them before coding the whole thing.
By checking the RPC on that pricey one here's what I see:
98 writes on one put. Now what's in there? I clicked on evaluate and found a dp.put(entities_to_put), and on entities_to_put there are _SlotRecord entities, _PipelineRecord entities and _BarrierRecord entities, the whole pipeline pack.
This is what was letting loose the writing frenzy. I had set up a Generator, that would start some other generator and then they'd all feed data back up the pipeline.
Well. The pipeline API does an amazing job for mapreduce, but clearly not usefull at all for what I had in mind here. My plan ended up being overkill. I was a little deceived by the really simple samples at the getting started guide. They just show how simple it is to actually set one up, as you see in the gists above, they truely are. Denormalizing at the source is the way to go in this case. Data comes in denormalized and Datastore saves it, done.