Friday, October 08, 2010

Make pig 0.7.0 work with Hadoop 21?

I was trying to learn how to use Pig.

My hadoop version is 0.21.0, and my pig version was 0.7.0.

In line with the the instructions on Hadoop The Definitive Guide, I set the environment variables as following:

/usr/local/pig/bin$ printenv | grep PIG
PIG_HOME=/usr/local/pig
PIG_INSTALL=/usr/local/pig
PIG_HADOOP_VERSION=21
PIG_CLASSPATH=/usr/local/hadoop/conf

Then when I was trying to run pig, I was getting the following error:
/usr/local/pig/bin$ pig
10/10/08 22:15:16 INFO pig.Main: Logging error messages to: /opt/pig-0.7.0/bin/pig_1286601316333.log
2010-10-08 22:15:16,978 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://mini:54310
2010-10-08 22:15:17,632 [main] ERROR org.apache.pig.Main - ERROR 2999: Unexpected internal error. Failed to create DataStorage
Details at logfile: /opt/pig-0.7.0/bin/pig_1286601316333.log

Looking at the list of files under pig/lib:
/usr/local/pig/lib$ ls -lt
total 34240
-rw-r--r--@ 1 hadoop admin 807 May 5 11:19 hadoop-LICENSE.txt
-rw-r--r--@ 1 hadoop admin 8006352 May 5 11:19 hadoop20.jar
drwxr-xr-x@ 4 hadoop admin 136 May 5 11:19 jdiff
-rw-r--r--@ 1 hadoop admin 4784831 May 5 11:19 hadoop18.jar
-rw-r--r--@ 1 hadoop admin 171596 May 5 11:19 automaton.jar
-rw-r--r--@ 1 hadoop admin 1916683 May 5 11:19 hbase-0.20.0-test.jar
-rw-r--r--@ 1 hadoop admin 1530035 May 5 11:19 hbase-0.20.0.jar
-rw-r--r--@ 1 hadoop admin 1109768 May 5 11:19 zookeeper-hbase-1329.jar
/usr/local/pig/lib$

I wasn't sure if hadoop 21 was supported.

So I added the following in bin/pig, which is a bash script:

# HINADA
# exec "$JAVA" $JAVA_HEAP_MAX $PIG_OPTS -classpath "$CLASSPATH" $CLASS ${remaining}
exec "$JAVA" $JAVA_HEAP_MAX $PIG_OPTS -classpath "/usr/local/hadoop/hadoop-common-0.21.0.jar:/usr/local/hadoop/hadoop-hdfs-0.21.0.jar:/usr/local/hadoop/hadoop-mapred-0.21.0.jar:$CLASSPATH" $CLASS ${remaining}

With this, when I type pig, now I get:

/usr/local/pig/bin$ pig
10/10/08 22:18:03 INFO pig.Main: Logging error messages to: /opt/pig-0.7.0/bin/pig_1286601483112.log
2010-10-08 22:18:03,570 [main] WARN org.apache.hadoop.conf.Configuration - user.name is deprecated. Instead, use mapreduce.job.user.name
2010-10-08 22:18:03,573 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: local
2010-10-08 22:18:03,716 [main] INFO org.apache.hadoop.security.Groups - Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000
2010-10-08 22:18:03,833 [main] WARN org.apache.hadoop.conf.Configuration - mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
2010-10-08 22:18:04,170 [main] WARN org.apache.hadoop.conf.Configuration - user.name is deprecated. Instead, use mapreduce.job.user.name
grunt>

Now I can see the grunt prompt!

However, my happiness was short-lived...

When I ran the DUMP tasks, I was getting the following error in the log:

ERROR 2998: Unhandled internal error. org.apache.hadoop.mapred.jobcontrol.JobControl.addJob(Lorg/apache/hadoop/mapred/jobcontrol/Job;)Ljava/lang/String;

I installed Hadoop 20, and used Pig with Hadoop 20, and the problem does not happen, so I need to stick with Hadoop 20 for now.

I posted my experiences at my website below:
https://sites.google.com/site/winstoninada/playing-with-pig

Wednesday, October 06, 2010