However, there are several things I hate about pig:
- Lazy evaluation. Actually the fact that there is only one mode. The lazy evaluation mode...:(
- Errors that make no sense. "Can't cast bytearray into bytearray". Why would I want to turn an apple into an apple?!...Another one of my favorites is sth like "Internal error occured" and then a bunch of exceptions that point to the pig source code. no information in the error message and absolutely no indication of where did the error occur in the pig script.
- kinda slow. Well this isn't a big deal any more, plus most of the times I prefer the fact that I get my sh*t done way faster and in a cleaner way than in Java or streaming in cpp/python but still it's kinda annoying when you end up with a workflow that you could easily save a couple of hours or a couple of nodes if it was faster =)
- Adding in the previous argument **corner cases**...*sigh* I can't emphasize enough on this. In general, pig is not that much slower than writing in java map reduce for hadoop. But in some cases, you end up with a job that should take 5mins taking 105. I have a job that just finished that demonstrates exactly this "feature". 100 nodes, 105 minutes for some terabytes of data processing...
- It's not complex enough. I guess it's because of the whole "pig philosophy" but it's still not complex enough. I find myself crawling through UDF's or writing my own to do sth that is really super simple in Java.
I know it looks like pure pig bashing but it's more of a hate-love relationship. I hate pig for all of these but once I write a script in a couple of hours that I know that it's super easy to maintain and document, I feel so much happiness that I just can't resist coming back to her...:)
And of course with that comes the fact that when you assign a heir to this code, it's really easy to explain to him how does the code work. Try explaining complex map reduce code to someone that has no clue about MR heh...!
No comments:
Post a Comment