Ошибка сценария PIG при попытке сохранения в формате AVRO: элемент данных 2 не объединен ["null","string"]
Я пытаюсь сохранить данные в формате AVRO, но не могу понять, почему я получаю ошибку. Datum 2 не находится в объединении ["null","string"], что это значит?
Разбор xml:
REGISTER piggybank.jar
REGISTER /opt/cloudera/parcels/CDH/lib/pig/lib/avro.jar
REGISTER /opt/cloudera/parcels/CDH/lib/pig/lib/json-simple-1.1.jar
REGISTER /opt/cloudera/parcels/CDH/lib/pig/lib/snappy-java-1.0.4.1.jar
DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath()
DEFINE XMLLoader org.apache.pig.piggybank.storage.XMLLoader('data')
A = LOAD 'input' using XMLLoader as (x:chararray);
R = RANK A;
B = FOREACH R GENERATE
$0 as (id:chararray),
ToString(CurrentTime(),'yyyy-MM-dd HH:mm:ss.SSSSSS') as (CreatedTS:chararray),
CONCAT((chararray)XPath(x, 'data/key'), '_', ToString(CurrentTime(), 'yyMMddHmm'),'_',(chararray)$0) as (FileName:chararray),
XPath(x, 'data/title') as (title:chararray),
XPath(x, 'data/city') as (city:chararray),
XPath(x, 'data/country') as (country:chararray),
XPath(x, 'data/text') as (text:chararray),
XPath(x, 'data/empty_text') as (empty_text:chararray);
C = DISTINCT B;
DUMP C;
DUMP C:
(2,2017-06-12 14:21:35.937000,f385a4_1706121421_2,Data text for two,תל אביב -יפו,IL,תל אביב -יפו, מחוז תל אביב,)
(3,2017-06-12 14:21:35.937000,657e21_1706121421_3,Data text three,תל אביב -יפו,IL,תל אביב -יפו,)
(4,2017-06-12 14:21:35.937000,5700da_1706121421_4,Data text four,Dublin,IE,Text data for example,)
(1,2017-06-12 14:21:35.937000,22bafc_1706121421_1,Data text one,Letterkenny,IE,Text data for example,)
Хранить:
STORE C INTO 'output' USING org.apache.pig.piggybank.storage.avro.AvroStorage();
2017-06-12 14:57:19,857 [main] ERROR org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to recreate exception from backed error: AttemptID:attempt_1496327466789_0452_r_000000_3 Info:Error: java.io.IOException: org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.RuntimeException: Datum 2 is not in union ["null","string"]
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:469)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:432)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:412)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:256)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.RuntimeException: Datum 2 is not in union ["null","string"]
at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:308)
at org.apache.pig.piggybank.storage.avro.PigAvroRecordWriter.write(PigAvroRecordWriter.java:49)
at org.apache.pig.piggybank.storage.avro.AvroStorage.putNext(AvroStorage.java:749)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
at org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:558)
at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at org.apache.hadoop.mapreduce.lib.reduce.WrappedReducer$Context.write(WrappedReducer.java:105)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:467)
... 11 more
Caused by: java.lang.RuntimeException: Datum 2 is not in union ["null","string"]
at org.apache.pig.piggybank.storage.avro.PigAvroDatumWriter.resolveUnionSchema(PigAvroDatumWriter.java:128)
at org.apache.pig.piggybank.storage.avro.PigAvroDatumWriter.writeUnion(PigAvroDatumWriter.java:111)
at org.apache.pig.piggybank.storage.avro.PigAvroDatumWriter.write(PigAvroDatumWriter.java:82)
at org.apache.pig.piggybank.storage.avro.PigAvroDatumWriter.writeRecord(PigAvroDatumWriter.java:365)
at org.apache.avro.generic.GenericDatumWriter.writeWithoutConversion(GenericDatumWriter.java:105)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:73)
at org.apache.pig.piggybank.storage.avro.PigAvroDatumWriter.write(PigAvroDatumWriter.java:99)
at org.apache.avro.generic.GenericDatumWriter.write(GenericDatumWriter.java:60)
at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:302)
... 19 more
Я пытался определить схему AVRO явно, но безуспешно, та же ошибка
STORE C INTO 'output' USING org.apache.pig.piggybank.storage.avro.AvroStorage(
'schema', '{
"name": "Myschema",
"type": "record",
"fields":
[
{"name": "id", "type": ["string", "null"]},
{"name": "FileName", "type": ["string", "null"]},
{"name": "title", "type": ["string", "null"]},
{"name": "city", "type": ["string", "null"]},
{"name": "country", "type": ["string", "null"]},
{"name": "text", "type": ["string", "null"]},
{"name": "empty_text", "type": ["string", "null"]}
]
}'
);
Тип всех полей - Srting (chararray), так как я получил их из XML.
ОПИСАТЬ C:
DESCRIBE C;
C: {id: chararray,CreatedTS: chararray,FileName: chararray,title: chararray,city: chararray,country: chararray,text: chararray,empty_text: chararray}
Я не смог найти в интернете никакой информации, кроме этой: http://www.gauravp.com/2014/06/pig-error-error-2997-encountered.html Pls, кто-нибудь может объяснить?
1 ответ
Решить путем неявного приведения в PIG, (chararray)$0 как (id:chararray). RANK возвращал тип long - было первопричиной. Поэтому всегда старайтесь проверять типы и возвращаемые типы, например, из udf.