Apache Pig : UDF
Apache Pig : Writing a User Defined Function (UDF)
Apache Pig : Writing a User Defined Function (UDF)
Preface:
In this post we will write a basic/demo custom function for Apache Pig, called as UDF (User Defined Function).
Pig’s Java UDF extends functionalities of EvalFunc. This abstract class have an abstract method “exec” which user needs to implement in concrete class with appropriate functionality.
Pig’s Java UDF extends functionalities of EvalFunc. This abstract class have an abstract method “exec” which user needs to implement in concrete class with appropriate functionality.
Problem Statement:
Lets write a simple Java UDF which takes input as Tuple of two DataBag and check whether second databag(set) is subset of first databag(set).
For example, Assume you have been given tuple of two databags. Each DataBag contains elements(tuples) as number.
For example, Assume you have been given tuple of two databags. Each DataBag contains elements(tuples) as number.
Input:
Databag1 : {(10),(4),(21),(9),(50)}
Databag2 : {(9),(4),(50)}
Output:
True
Databag1 : {(10),(4),(21),(9),(50)}
Databag2 : {(9),(4),(50)}
Output:
True
Then function should return true as Databag2 is subset of Databag1.
From implemetation point of view
As we are extending abstract class EvalFucn, we will be implementing exec function. In this function we’ll write logic to find is given set is subset of other or not. We will also override function outputSchema to specify output schema ( boolean : true or false ).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
| import java.io.IOException; import java.util.HashSet; import java.util.Iterator; import java.util.List; import java.util.Set; import org.apache.pig.EvalFunc; import org.apache.pig.data.DataBag; import org.apache.pig.data.DataType; import org.apache.pig.data.Tuple; import org.apache.pig.impl.logicalLayer.schema.Schema; import org.apache.pig.impl.logicalLayer.schema.Schema.FieldSchema; /** * Find the whether given SetB is subset of SetA. * * input: * * * * output: * * |
*
*
*/
public
class
IsSubSet
extends
EvalFunc {
@Override
public
Schema outputSchema(Schema input) {
if
(input.size()!=
2
){
throw
new
IllegalArgumentException(
"input should contains two elements!"
);
}
List fields = input.getFields();
for
(FieldSchema f : fields){
if
(f.type != DataType.BAG){
throw
new
IllegalArgumentException(
"input fields should be bag!"
);
}
}
return
new
Schema(
new
FieldSchema(
"isSubset"
,DataType.BOOLEAN));
}
private
Set populateSet(DataBag dataBag){
HashSet set =
new
HashSet();
Iterator iter = dataBag.iterator();
while
(iter.hasNext()){
set.add(iter.next());
}
return
set;
}
@Override
public
Boolean exec(Tuple input)
throws
IOException {
Set setA = populateSet((DataBag) input.get(
0
));
Set setB = populateSet((DataBag) input.get(
1
));
return
setA.containsAll(setB) ? Boolean.TRUE : Boolean.FALSE;
}
}
A Quick Test
Lets test our UDF to find whether given set is subset of other set or not.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| -- Register jar which contains UDF. register '/home/hadoop/udf.jar' ; -- Define function for use. define isSubset IsSubSet(); -- lets assume we have dataset as following : dump datset; --({(10),(4),(21),(9),(50)},{(9),(4),(50)}) --({(50),(78),(45),(7),(4)},{(7),(45),(50)}) --({(1),(2),(3),(4),(5)},{(4),(3),(50)}) -- lets check subset function result = foreach dataset generate $0,$1,isSubset($0,$1); dump result; --({(10),(4),(21),(9),(50)},{(9),(4),(50)},true) --({(50),(78),(45),(7),(4)},{(7),(45),(50)},false) --({(1),(2),(3),(4),(5)},{(4),(3),(50)},false) |
0 comments:
Post a Comment