Monday, July 28, 2014

Denormalization using the Google Pipeline API

I've been using Google's MapReduce for AppEngine for some time. It uses the Pipeline API to connect its map, shuffle and reduce phases. It gave me ideas.

I grabbed that pipeline API and implemented a denormalizing pipeline. It'd receive data from each table from a relational database and denormalize the data to Appengine's Datastore (non relational). What the pipeline would help with is wait for the missing table so it could do the joins to complete the denormalization. It worked, but datastore writes were soaring, quickly making my app hit the daily budget.

I decided to run some tests. Yes, I should've ran them before coding the whole thing.



The first call to the run method writes 32 times to the datastore. Summing up all the writes and we have a total of 104. Each call to the run method that has a child writes around 30 times to DS, and the last one without a child writes just 8 times.



Now writes get down to business: 108 writes on the generator pipeline, 8 on the other calls to run without a child. 162 total on the rest, summing it all up to 270. Ouch.

By checking the RPC on that pricey one here's what I see:


98 writes on one put. Now what's in there? I clicked on evaluate and found a dp.put(entities_to_put), and on entities_to_put there are _SlotRecord entities, _PipelineRecord entities and _BarrierRecord entities, the whole pipeline pack.

This is what was letting loose the writing frenzy. I had set up a Generator, that would start some other generator and then they'd all feed data back up the pipeline.

Well. The pipeline API does an amazing job for mapreduce, but clearly not usefull at all for what I had in mind here. My plan ended up being overkill. I was a little deceived by the really simple samples at the getting started guide. They just show how simple it is to actually set one up, as you see in the gists above, they truely are. Denormalizing at the source is the way to go in this case. Data comes in denormalized and Datastore saves it, done.