Saturday, August 30, 2014

Bootable Windows 7 x64 Install Flash Drive

Create a Bootable Windows 7 x64 Install Flash Drive from 32-bit Windows

Creating a bootable Windows 7 x64 flash drive from within a 32-bit install of Windows is not as straight forward as it may seem. I recently had to go through this process myself, so I’ll document the steps below.

Things you’ll need

Create the installer

Install the Windows 7 USB/DVD Download Tool.
Extract the 32-bit bootsect.exe file to the directory that the Windows 7 USB/DVD Download Tool was installed to. This is usually something like“C:\Users\username\AppData\Local\Apps\Windows 7 USB DVD Download Tool“.
Run the Windows 7 USB/DVD Download Tool and select your Windows 7 disc image. Follow the remaining steps in this tool and your image should be created successfully!

If you’ve followed these steps and your flash installer was created successfully then your next step is to, of course, install Windows 7! Don’t forget to change your boot options to load from USB!

August 30, 2014 windows 0

command line arguments to Pig scripts

Parameter Placeholder

First, we need to create a place holder for the parameter that needs to be replaced inside the Pig script. Let’s say you have the following line in your Pig script where you are loading an input file.

INPUT = LOAD '/data/input/20130326'

In the above statement, if you want to replace date part dynamically, then have to create a placeholder for it.

INPUT = LOAD '/data/input/$date'

Individual Parameters

To pass individual parameters to the Pig script we can use the -param option while invoking the Pig script. So the syntax would be

pig -param date=20130326 -f myfile.pig

If you want to pass two parameters then you can add one more -param option.

pig -param date=20130326 -param date2=20130426 -f myfile.pig

Param File

If there are lot of parameters that needs to be passed, or if we needed a more flexible way to do it, then we can place all of them in a single file and pass the file name using the -param_file option.

The param file uses the simple ini file format where every line contains the param name and the value. We can specify comments using the # character.

date=20130326
date2=20130426

We can pass the param file using the following syntax

pig -param_file=myfile.ini -f myfile.pig

Default Statement

We can also assign a default value to a parameter inside the Pig script using the default statement like below

%default date '20130326'

Processing Order

One good thing about parameter substitution in Pig is that you can pass in value for the same parameter using multiple options simultaneously. Pig will pick them up in the following order.

The default statement takes the lowest precedence.
The values passed using -param_file takes the next precedence.
- If there are multiple entries for the same param is present in a file, then the one which comes later takes more precedence.
- If there are multiple param files, then the files that are specified later will take more precedence.
The values that are passed using the -param option takes the next precedence.
- If multiple values are specified for the same param, then the ones which are specified later takes more precedence.

Debugging

Sometimes, the precedence might be little confusing, especially if you have multiple files and multiple params. Pig also provides a -debug option to debug this kind of scenario’s. If you invoke Pig with this option, then it will generate a file with extension .substitued in the current directory with the place holders replaced with the correct values.

I specify a default value using the default statement and then pass actual values using the -param_fileoption. If I am in a hurry and just want to test something locally, then I use -param option, but generally I try to put them in a separate ini file so that I can check-in the options as well.

August 27, 2014 Hadoop Pig 0

Learn Apache Sqoop

Sqoop

Learn Apache Sqoop Installation

Apache Sqoop Tutorial Part 1

Apache Sqoop Tutorial Part 2

Apache Sqoop Tutorial Part 3

Apache Sqoop Tutorial Part 4

August 27, 2014 Sqoop 0

COURSE VIDEOS: ANALYZING BIG DATA WITH TWITTER

August 26, 2014 0

Microsoft Hadoop HDInsight ( Hadoop on Windows Azure)

Windows Azure HDInsight Service, formerly known as Hadoop on Windows Azure, is now available inside the Windows Azure Preview portal. Hadoop-based big data tools are what I call the WMD(P), or Weapons of Mass Data Processing. (You heard it here first!) This is a very exciting development, and I would like to take a moment to recognize the great work our HDInsight team has done to pull this off.

Signing up

1. To sign up, log into your Windows Azure account at http://www.windowsazure.com. At the bottom of the portal, click on New (+).

2. Then click on DATA SERVICES followed by HDInsight. Use the preview program link to navigate to sign up page.

At the Preview features page, also accessible from https://account.windowsazure.com/PreviewFeatures/, click on try it now next to Azure HDInsight Preview. Please be warned that this might take up-to a few days.

Getting Started and more Learning Content

Overview presentation

Check out a copy from: https://github.com/WindowsAzure-TrainingKit/PRESENTATION-WindowsAzureHDInsight download using the zip link.

Or a download from a direct link: http://bit.ly/YPstrp

Hands-on Lab

https://github.com/WindowsAzure-TrainingKit/HOL-WindowsAzureHDInsight/blob/master/HOL.md

New channel 9 Video Series

http://channel9.msdn.com/Series/Getting-started-with-Windows-Azure-HDInsight-Service/

Direct links to the Windows Azure Portal Documentation Section

Quick Start

Visit these articles first to get started using the HDInsight Service.

Tutorial: Getting Started with the Windows Azure HDInsight Service

Tutorial: Using MapReduce with HDInsight

Tutorial: Using Hive with HDInsight

Tutorial: Using Pig with HDInsight

Explore

Guidance: Introduction to the Windows Azure HDInsight Service

A conceptual overview of the components of HDInsight.

Tutorial: Get Started with the HDInsight Service

Learn the fundamentals of working with the HDInsight service.

Plan

Guidance: What version of Hadoop is in Windows Azure HDInsight?

Learn what Hadoop components and versions are included in HDInsight.

Guidance: HDInsight Interactive JavaScript and Hive Consoles

HDInsight comes with interactive consoles for JavaScript and Hive that you can use as an alternative to remoting into the head node of a Hadoop cluster. Learn about how you can use the consoles to enter expressions, evaluate them, and then query and display the results of a MapReduce job immediately.

Upload

How to: Upload data to HDInsight

Learn how to upload and access data in HDInsight using Azure Storage Explorer, the interactive console, the Hadoop command line, or Sqoop.

Analyze

Tutorial: Connect Excel to Windows Azure HDInsight via HiveODBC

One key feature of Microsoft’s Big Data Solution is solid integration of Apache Hadoop with Microsoft Business Intelligence (BI) components. A good example of this is the ability for Excel to connect to the Hive data warehouse framework in the Hadoop cluster. This topic walks you through using Excel via the Hive ODBC driver.

Tutorial: Simple recommendation engine using Apache Mahout

Apache Mahout is a machine learning library built for use in scalable machine learning applications. Recommender engines are some of the most immediately recognizable machine learning applications in use today. In this tutorial you use the Million Song Dataset site and download the dataset to create song recommendations for users based on their past listening habits.

Tutorial: Analyzing Twitter Data with Hive

In this tutorial you will query, explore, and analyze data from Twitter using the Apache Hadoop-based HDInsight Service for Windows Azure and a complex Hive example.

Tutorial: Using HDInsight to process Blob Storage data and write the results to a SQL Database

This tutorial will show you how to use the HDInsight service to process data stored in Windows Azure Blob Storage and move the results to a Windows Azure SQL Database.

Manage

How to: Administer HDInsight

Learn how to create an HDInsight cluster and use the administrative tools available through the Windows Azure Management Portal.

How to: Monitor HDInsight

Learn how to monitor an HDInsight cluster and view Hadoop job history through the Windows Azure Management Portal.

How to: Deploy an HDInsight Cluster Programmatically

Learn how to use the Windows Azure service management REST API to create, list, and delete HDInsight clusters programmatically.

How to: Execute Remote Jobs on Your HDInsight Cluster Programmatically

Learn how to use the WebHCat REST API to provide metadata management and remote job submission to your Hadoop cluster.

August 25, 2014 HDInsight 0

Pig UDF : Eval

August 23, 2014 Hadoop Pig 0

Apache Pig : Writing a User Defined Function (UDF)

Preface:

In this post we will write a basic/demo custom function for Apache Pig, called as UDF (User Defined Function).
Pig’s Java UDF extends functionalities of EvalFunc. This abstract class have an abstract method “exec” which user needs to implement in concrete class with appropriate functionality.

Problem Statement:

Lets write a simple Java UDF which takes input as Tuple of two DataBag and check whether second databag(set) is subset of first databag(set).
For example, Assume you have been given tuple of two databags. Each DataBag contains elements(tuples) as number.

Input:
Databag1 : {(10),(4),(21),(9),(50)}
Databag2 : {(9),(4),(50)}
Output:
True

Then function should return true as Databag2 is subset of Databag1.

From implemetation point of view

As we are extending abstract class EvalFucn, we will be implementing exec function. In this function we’ll write logic to find is given set is subset of other or not. We will also override function outputSchema to specify output schema ( boolean : true or false ).

import java.io.IOException;

import java.util.HashSet;

import java.util.Iterator;

import java.util.List;

import java.util.Set;

import org.apache.pig.EvalFunc;

import org.apache.pig.data.DataBag;

import org.apache.pig.data.DataType;

import org.apache.pig.data.Tuple;

import org.apache.pig.impl.logicalLayer.schema.Schema;

import org.apache.pig.impl.logicalLayer.schema.Schema.FieldSchema;

/**

 * Find the whether given SetB is subset of SetA.

 * 

 *  input:

 * 
setA : {(10),(4),(21),(9),(50)}

 * 
setB : {(9),(4),(50)}

 * 

 *  output:

 * 
true

 *

*

*

*/

public class IsSubSet extends EvalFunc {

    @Override

    public Schema outputSchema(Schema input) {

        if(input.size()!=2){

            throw new IllegalArgumentException("input should contains two elements!");

}

        List fields = input.getFields();

        for(FieldSchema f : fields){

            if(f.type != DataType.BAG){

                throw new IllegalArgumentException("input fields should be bag!");  

}

}

        return new Schema(new FieldSchema("isSubset",DataType.BOOLEAN));

}

    private Set populateSet(DataBag dataBag){

        HashSet set = new HashSet();

        Iterator iter = dataBag.iterator();

        while(iter.hasNext()){

            set.add(iter.next());

}

        return set;

}

    @Override

    public Boolean exec(Tuple input) throws IOException {

        Set setA = populateSet((DataBag) input.get(0));

        Set setB = populateSet((DataBag) input.get(1));

        return setA.containsAll(setB) ? Boolean.TRUE : Boolean.FALSE;

}

}

A Quick Test

Lets test our UDF to find whether given set is subset of other set or not.

-- Register jar which contains UDF.

register '/home/hadoop/udf.jar';

-- Define function for use.

define isSubset IsSubSet();

-- lets assume we have dataset as following :

 dump datset;

--({(10),(4),(21),(9),(50)},{(9),(4),(50)})

--({(50),(78),(45),(7),(4)},{(7),(45),(50)})

--({(1),(2),(3),(4),(5)},{(4),(3),(50)})

-- lets check subset function

result = foreach dataset generate $0,$1,isSubset($0,$1);

dump result;

--({(10),(4),(21),(9),(50)},{(9),(4),(50)},true)

--({(50),(78),(45),(7),(4)},{(7),(45),(50)},false)

--({(1),(2),(3),(4),(5)},{(4),(3),(50)},false)

About

Posted on

About these ads

August 23, 2014 Hadoop Pig 0

Saturday, August 30, 2014

Wednesday, August 27, 2014

Parameter Placeholder

Individual Parameters

Param File

Default Statement

Processing Order

Debugging

Tuesday, August 26, 2014

COURSE VIDEOS: ANALYZING BIG DATA WITH TWITTER

Monday, August 25, 2014

Quick Start

Explore

Plan

Upload

Analyze

Manage

Saturday, August 23, 2014

Apache Pig : Writing a User Defined Function (UDF)

Preface:

Problem Statement:

From implemetation point of view

A Quick Test

Popular Posts

Recent Posts

Categories

Unordered List

Text Widget

Blog Archive