multiple outputs hadoop

Possible Duplicate: MultipleOutputFormat in hadoop How can I change the code in the WordCount.java program in the examples such that the output of the WordCounts for each file is put on separate files. That is, instead of having a single wordcount all files in that default part-00000 file. Also the output file always has the name part-00000 or some other name along those lines, can I chose the output filename I want for this file, and if so how? I imagine I have to configure this in the

Hadoop Writing to HBase in MapReduce using MultipleOutputs

I currently have a MapReduce job that uses MultipleOutputs to send data to several HDFS locations. After that completes, I am using HBase client calls (outside of MR) to add some of the same elements to a few HBase tables. It would be nice to add the HBase outputs as just additional MultipleOutputs, using TableOutputFormat. In that way, I would distribute my HBase processing. Problem is, I cannot get this to work. Has anyone ever used TableOutputFormat in MultipleOutputs...? With multiple

Hadoop Interesting projects based on Distributed/Operating Systems

I would like to know some interesting challenges based on distributed systems that could be solved within the time frame of a quarter (my university follows quarter system!). I am hoping to work on a single project that would satisfy both an Operating Systems course as well as a Distributed Systems course so that I would have plenty of time to work on it (as I have taken both courses!). I am looking for a strong programming component. Could anybody point me in the right direction? I know Hadoop/

Splittable Bz2 input in Hadoop 1.0.0

I have a cluster that uses Hadoop 1.0.0 and I would like to run a MR job that processes huge bz2 files. In version 0.21.0 the Bz2 codec supported splitting of input files, however I could not find this functionality in 1.0.0. Is there any equivalent of splitting the bz2 input in 1.0.0? Or I should manually apply the patch from 0.21.0 for this?

Writing Data to Cassandra Hadoop Mapper(no reduce)

I get the following exception while trying to write to cassandra directly from the map, skipping the reduce task. . . . ConfigHelper.setOutputColumnFamily(job.getConfiguration(), KEYSPACE, outputPath); job.setMapperClass(MapperToCassandra.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); LOG.info("Writing output to Cassandra"); //job.setReducerClass(ReducerToCassandra.class); job.setOutputFormatClass(ColumnFamilyOutputFormat.class);

Hadoop Example and more explanation about LoadFunc

Where can I find more information/example about LoadFunc. Except for the http://web.archive.org/web/20130701024312/http://ofps.oreilly.com/titles/9781449302641/load_and_store_funcs.html I dont see any examples that use the new LoadFunc APis. Can anyone please let me know where I can find some example for writing Load UDF?

Apache Pig 0.10 with CDH3u0 Hadoop failed to work as normal user

I have used Pig, but new to Hadoop/Pig installation. I have DH3u0 Hadoop installed running Pig 0.8 I downloaded Pig 0.10 and installed it in a separate directory. I am able to start pig as root user, but failed to start pig as normal user with the following error: Exception in thread "main" java.io.IOException: Permission denie at java.io.UnixFileSystem.createFileExclusively(Native Method) at java.io.File.checkAndCreate(File.java:1704) at java.io.File.createTempFile(File.java:1792) at org.apac

Read Hadoop SequenceFile: weird hex number stream

I am trying to convert a piece of Hadoop SequenceFile into plain text with the following code: Configuration config = new Configuration(); Path path = new Path( inputPath ); SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(config), path, config); WritableComparable key = (WritableComparable) reader.getKeyClass().newInstance(); Writable value = (Writable) reader.getValueClass().newInstance(); File output = new File(outputPath); if(!output.exists())

Hadoop How to configure hbase using cygwin?

i have configured my Hadoop in cygwin now i m trying to configure Hbase i have made some changes in hbase-evn.sh as: export JAVA_HOME=C:\\java\\jre export HBASE_CLASSPATH=C:\\cygwin\\usr\\local\\hbase\\lib\\zookeeper-3.3.2 export HBASE_REGIONSERVERS=C:\\cygwin\\usr\\local\\hbase\\conf\\regionservers export HBASE_MANAGES_ZK=true and some changes in hbase-site.xml <property> <name>hbase.rootdir</name> <value>hdfs://localhost:9100/hbase</value> </pr

Hadoop why group in pig has odd order behavier

In hadoop if you want to group and order something and you write java , the result is that group keys will also be sort defaultly by Lexicographical order,all done with one MR job, so that you spare the another order job. But now I join using Pig , find a quirky thing. My input (test.txt) is: a ab abc b c My script is: A=load 'test.txt' as c1:chararray; B=group A by c1; dump B; The output is: (a) (b) (c) (ab) (abc) Why it has group key order depend on

Hadoop Access hdfs file from udf

I`d like to access a file from my udf call. This is my script: files = LOAD '$docs_in' USING PigStorage(';') AS (id, stopwords, id2, file); buzz = FOREACH files GENERATE pigbuzz.Buzz(file, id) as file:bag{(year:chararray, word:chararray, count:long)}; The jar is registered. The path is realtive to my hdfs, where the files really exist. The call is made. But seems that the file is not discovered. Maybe beacause I'm trying to access the file on hdfs. How can I access a file in hdfs, from my

Increasing size of files created by Flume pipe into Hadoop

I've got a configuration file for Flume that looks like this: TwitterAgent.sources = Twitter TwitterAgent.channels = MemChannel TwitterAgent.sinks = HDFS TwitterAgent.sources.Twitter.type = TwitterAgent.sources.Twitter.channels = MemChannel TwitterAgent.sources.Twitter.consumerKey = TwitterAgent.sources.Twitter.consumerSecret = TwitterAgent.sources.Twitter.accessToken = TwitterAgent.sources.Twitter.accessTokenSecret = TwitterAgent.sources.Twitter.keywords = TwitterAgent.sinks.HDFS.ch

Hadoop Mapper task completion event

We have a map reduce task set up on HBase. I have a requirement that I need to know once a mapper instantiated by the framework completes its task. Is there any event that I have to look upon? Thanks.

Hadoop HDFS Safe mode issue

I am facing an issue with HDFS. Error is given below: Problem accessing /nn_browsedfscontent.jsp. Reason: Cannot issue delegation token. Name node is in safe mode.The reported blocks 428 needs additional 2 blocks to reach the threshold 0.9990 of total blocks 430. Safe mode will be turned off automatically. I even tried to leave safe mode using the command. But I am getting superuser privilege issue, even if I tried as the root user. I am using CDH 4. Can any one let me know why t

Hadoop Checking a bag is Null or not inside foreach in pig

I am joining 3 tables and inside foreach i need to check wheather ReadStagingData bag is null. Below is the code ReadStagingData = Load 'Staging_data.csv' Using PigStorage(',') As (PL_Posn_id:int,Brok_org_dly:double,Brok_org_ptd:double); ReadPriorData = Load 'ptd.csv' Using PigStorage(',') As (PL_Posn_id:int,Brok_org_ptd:double); ReadPriorFunctional = Load 'Functional.csv' Using PigStorage(',') AS (PL_Posn_id:int,Brok_fun_ptd:double,Brok_fun_ltd:double); JoinDS1 = JOIN ReadPriorData BY

how to validate the data from RDB to Hadoop HDFS

Please let me know which tool is preferable to validate the data in data migration from RDB to Hadoop HDFS. My requirement is to validate the data which is migrating from oracle to hadoop hdfs. the output is a flat file get stored into hadoop hdfs.

Hadoop Error: Jobflow entered COMPLETED while waiting to ssh

I just started to practice AWS EMR. I have a sample word-count application set-up, run and completed from the web interface. Following the guideline here, I have setup the command-line interface. so when I run the command: ./elastic-mapreduce --list I receive j-27PI699U14QHH COMPLETED ec2-54-200-169-112.us-west-2.compute.amazonaws.comWord count COMPLETED Setup hadoop debugging COMPLETED Word count Now, I want to see the log files. I run the comm

job struck at map 0% reduce 0% in Hadoop 2

And I type jps: 4961 RunJar 3931 DataNode 2543 org.eclipse.equinox.launcher_1.3.0.v20130327-1440.jar 4183 SecondaryNameNode 4569 NodeManager 4450 ResourceManager 6191 Jps 5043 MRAppMaster 3791 NameNode log of datanode: 2013-12-16 21:53:27,502 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-692584010-127.0.0.1-1386505361103:blk_1073741910_1086 src: /127.0.0.1:51842 dest: /127.0.0.1:50010 2013-12-16 21:53:27,577 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace

Hadoop HDP Ambari installation in Ubuntu12.04

What are the steps to install HDP 2.0 through Ambari? I have tried with the steps described in the Hortonworks documentation. But, the installation is not successful.

How to use neo4j as input to hadoop?

I have a large neo4j database. I need to check for multiple patterns existing across the graph, which I was thinking would be easily done in hadoop. However, I'm not sure of the best way to feed tuples from neo4j into hadoop. Any suggestions?

hadoop 2.2.0 Exception from container-launch at java.lang.Thread.run(Thread.java:744)

I'm new to Hadoop and trying to set hadoop 2.2.0 on a server with 32-cores & 64GB mem & 8 disks, while I tweaked around with the files 'yarn-site.xml', I found that when I add <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>40960</value> </property> to 'yarn-site.xml' and run: hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 10 5 I get the error: 14/01/23 19:42:55 INFO mapreduce.Job: Task Id : attempt

Hadoop How to clear Zookeeper corruption

I am looking for some help in figuring out how to clear what looks like a corruption in Zookeeper. Our setup was running fine with Solr Cloud. At some point the root partition on one of the cluster nodes became full and the system went down. After we brought it back up, Solr was not responding and could not start. It looks like there is a corruption in the zookeeper data. Anytime a client tries to access the node /overseer/queue it will kill the connection with an error: ..."KeeperExcep

Hadoop Why Is a Block in HDFS So Large?

Can somebody explain this calculation and give a lucid explanation? A quick calculation shows that if the seek time is around 10 ms and the transfer rate is 100 MB/s, to make the seek time 1% of the transfer time, we need to make the block size around 100 MB. The default is actually 64 MB, although many HDFS installations use 128 MB blocks. This figure will continue to be revised upward as transfer speeds grow with new generations of disk drives.

Hadoop How do you use ORCFile Input/Output format in MapReduce?

I need to implement a custom I/O format based on ORCFile I/O format. How do I go about it? Specifically I would need a way to include the ORCFile library in my source code (which is a custom Pig implementation) and use the ORCFile Output format to write data, and later use the ORCFile Input format to read back the data.

Hadoop How to configure Hue-2.5.0 and HIve-0.11.0

From past 2 days I have been working on setting up Hue but no luck. The versions I tried with hive 0.11.0 :- 3.5, 3.0, 2.4, 2.1, 2.3, 2.5 After much googling i came to know 3.5 and 3.0 (documentation says 0.11) are compatible with hive 0.12 or 0.13 but as mine is 0.11 I faced issues like : Required client protocal , no database found, list index error. Finally I was able to set up Hue 2.5.0 and it indeed connects with hiveserver2. My Properties in hue.ini : beeswax_server_host=localhost ser

Exception in thread "main" java.lang.ClassCastException: WordCount cannot be cast to org.apache.hadoop.util.Tool

I'm trying to execute new api mapreduce wordcount example. This is my program and I'm using hadoop-core-1.0.4.jar as plugin. import java.io.IOException; import java.util.Iterator; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Redu

Hadoop Cannot run Hive metastore with remote mode

I added Hive in my Apache Hadoop distributed cluster. Instead creating metastore in local directory, I would like to create metastore in HDFS. However, my settings in hive-site.xml don't seem to work. I got the error like below, which seemed that hive still tried to use run metastore in local mode. java.sql.SQLException: Directory /home/zz/metastore_db cannot be created. Can anyone tell me what has gone wrong with my settings? Thanks a lot! Below is my hive-site.xml content: <proper

unable to see Task tracker and Jobtracker after Hadoop single node installation 2.5.1

Iam new to Hadoop 2.5.1. As i have already installed Hadoop 1.0.4 previously, i thought installation process would be same so followed following tutorial. http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ Every thing was fine, even i have given these settings in core-site.xml <name>fs.default.name</name> <value>hdfs://localhost:54310</value> But i have seen in several sites this value as 9000. And also changes in yarn.xml.

Hadoop spilling records from circular buffer (mapper)

So as far as I understand, these mappers have a circular buffer that they keep writing to until a certain threshold is reached, when they decide to spill data to disk. This process may involve running a combiner and compressing data while writing to the disk. So, I was thinking if instead we could configure to compress the data in memory and store in memory, so that maybe at the end of the mapper it does not have overflown its circular memory and hence only need to spill once to the disk. Is th

facing issue while configuring apache Hive1.0 with apache hadoop1.2.1

Facing below mentioned permission issues while running hive in terminal after configuring in .bashrc file. hadoop@hadoop:~$ hive Logging initialized using configuration in jar:file:/home/hadoop/Downloads/apache-hive-1.0.0/lib/hive-common-1.0.0.jar!/hive-log4j.properties Exception in thread "main" java.lang.RuntimeException: java.lang.RuntimeException: The root scratch dir: /tmp/hive on HDFS should be writable. Current permissions are: rwx--x--x at org.apache.hadoop.hive.ql.session.SessionState

Hadoop Search for a particular text in a string - Hive

I need to get the required page groups. If i search for page group starting with 'google' using hive query, I need to get the data of the first 3 rows. /google/gmail/inbox /google/drive/map /google/apps In this way I need to get the data based on page group. I searched for the string using like function. select * from table where field like '%/google/%';

Hadoop Display all the fields associated with the record using Impala

Suppose, I have a student table with some fields in impala. Imagine there is a field called total_mark and I should find the student details with maximum mark from each branch. My table is like this :- In this table I have to get the details of student with maximum marks from each department. My query will be like this :- select id,max(total_marks) from student_details group by department; But using this query I can get only the id and total_marks. Provided there can be students with same name

Hadoop HIVE UDF error while creating

While creating an UDF for HIVE, I am getting below error: org.apache.ambari.view.hive.client.HiveErrorStatusException: H110 Unable to submit statement. Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.FunctionTask [ERROR_STATUS] I am working on hortonworks platform. I tried to create a simple UDF to simulate the issue, is this a config issue? UDF package HiveUDF.HIVEUDF; import org.apache.hadoop.hive.ql.exec.UDF; import org.apache

Hadoop pig map-reduce job fails after completing 33%

I am running a group by clause in apache pig and it is creating map reduce job,which is failing after 1/3 completion. Is there anyway I can troubleshoot this as logs doesn't give any reason of failure. What i am looking either of following. 1. Some way to find what the exact error is (i.e memory error,datatype error etc) 2. Any way to make the logs more verbose to write more error message on the screen. 2016-04-03 22:59:40,252 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapRedu

nutch on Hadoop on google cloud- gloud dataproc

I get below error when i try to run nutch on hadoop on google cloud (dataproc). any idea why i would be facing the issue user@cluster-1-m:~/apache-nutch-1.7/build$ hadoop jar /home/user/apache-nutch-1.7/runtime/deploy/apache-nutch-1.7.job org.apache.nutch.crawl.Crawl /tmp/testnutch/input/urls.txt -solr http://SOLRIP:8080/solr/ -depth 5 -topN2 16/09/11 17:57:38 INFO crawl.Crawl: crawl started in: crawl-20160911175737 16/09/11 17:57:38 INFO crawl.Crawl: rootUrlDir = -topN2 16/09/11 17:57

How to use hadoop with laravel 5.2

I tried to search but still there is not much examples out. Can anyone please give me some tutorial for the laravel hadoop integration. As in my development i want both the connections mysql and hadoop.

How can I connect to Remote Linux Nodes on which Hadoop is installed using Apache Nifi instance installed on my local Windows Box?

I have installed Apache nifi 1.1.1 on My Windows Local System. How can I connect to Remote Linux Nodes on which Hadoop is installed using Apache Nifi instance installed on my local Windows Box? Also How can I perform data migration activity on Remote Linux Nodes on which Hadoop is installed using these local instance of Nifi? I have enabled Kerberos on these Remote Hadoop Cluster.

Hadoop hive tables store as parquet fail

I am receiving while trying to insert data into hive table stored as parquet. I have a table with 103209 . It works when I write limit clause in select statement. CREATE EXTERNAL TABLE test_parquet (a bigint,b int) STORED AS PARQUET LOCATION 's3n://abc/test_parquet'; insert into test_parquet select a,b from stage_mri_travel limit 100; --- works insert into test_parquet select a,b from stage_mri_travel; --- fails Error: Caused by: java.lang.reflect.InvocationTargetException at sun.reflect

Hadoop How to convert a field into timestamp in hive

how to convert the forth field into timestamp? I have loaded into a table but while querying it is showing as NULL. 1::1193::5::978300760 my table format : CREATE TABLE `mv`( `uid` INT, `mid` INT, `rating` INT, `tmst` TIMESTAMP) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'field.delim'='::', 'serialization.format'='::') STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.

Hadoop Loading data into Hive Table from HDFS in Cloudera VM

When using the Cloudera VM how can you access information in the HDFS? I know there isn't a direct path to the HDFS but I also don't see how to dynamically access it. After creating a Hive Table through the Hive CLI I attempted to load some data from a file located in the HDFS: load data inpath '/test/student.txt' into table student; But then I just get this error: FAILED: SemanticException Line 1:17 Invalid path ''/test/student.txt'': No files matching path hdfs://quickstart.cloudera:8020/

Hadoop - `hdfs dfs -ls` versus ls

I connect to a Hadoop cluster at work using ssh. There seems to be two different file systems available from there : - one local (although it's remote since I'm sshing this machine). I can navigate the file system using cd .. and show files in there using ls. I can also install some programs here. - one not local using hadoop commands (hdfs dfs ...) I don't understand how these two file system work together ? Is the local kind of the master node of the hadoop cluster from which I can execute

Hadoop How to change hdfs block size of DataFrame in pysark

This seems related to How to change hdfs block size in pyspark? I can successfully change the hdfs block size with rdd.saveAsTextFile, but not the corresponding DataFrame.write.parquet and unable to save with parquet format. Unsure whether it's the bug in pyspark DataFrame or I did not set the configurations correctly. The following is my testing code: ########## # init ########## from pyspark import SparkContext, SparkConf from pyspark.sql import SparkSession import hdfs from hdfs impor

Hadoop How long should it take to process 15GB of logfile data on Amazon EMR?

I have a cluster with 1 Master (m4.large), 6 Core (m4.large), and 4 Task (m4.large) nodes. The 15GB of cloudfront log data splits into 35 mappers and 64 reducers. Currently, it's taking more than 30 minutes to process fully -- too long for my purposes, so I stop the job to reconfigure. How long would I expect the processing to take with this setup? What would be a reasonable resizing to get the job to run in under 15 minutes?

what is the difference between two downloadable versions of giraph: 1.2giraph-dist-1.2.0-hadoop2-bin.tar.gz and giraph-dist-1.2.0-bin.tar.gz

What is the difference between giraph-dist-1.2.0-hadoop2-bin.tar.gz and giraph-dist-1.2.0-bin.tar.gz. Is there any documentation about that? The only documentation that I found is the following one: Apache Hadoop 2 (latest version: 2.5.1) This is the latest version of Hadoop 2 (supporting YARN in addition to MapReduce) Giraph could use. You may tell maven to use this version with "mvn -Phadoop_2 ".

Hadoop Update/Edit records in Hdfs using Hive

I have some records of people in HDFS. I use external table in Hive to view, to do my analytics on that particular data and also I can use it externally in other programs. Recently I got an use case where I have to update the data in HDFS. As per documentation I got to know that we cant update or delete data using external table. Another problem is the data is not ORC format. It is actually in TEXTFILE format. So I am unable to do update or delete data in internal table too. As it is in producti

Hadoop How to send beeline output to sqoop

I am struggling to send beeline output to apache sqoop tool. I guess Apache sqooop can read data from where data sits on Hadoop cluster.But beeline can query data and output the data into where hadoop client is running. Is it possible to send beeline output directly to hadoop cluster or instruct apache sqoop to read data from machine where hadoop client not installed.

Hadoop Unable to create a table from Hive CLI - ERROR 23502

I seem to be getting the below exception when I try to create a table using Hive client. create table if not exists test (id int, name string) comment 'test table'; 11:15:32.016 [HiveServer2-Background-Pool: Thread-34] ERROR org.apache.hadoop.hive.metastore.RetryingHMSHandler - Retrying HMSHandler after 2000 ms (attempt 1 of 10) with error: javax.jdo.JDODataStoreException: Insert of object "org.apache.hadoop.hive.metastore.model.MTable@784fafc2" using statement "INSERT INTO TBLS (TBL_ID,CR

How does Hadoop deals with files with no key-value structure

I am new to Hadoop and I am learning the Map Reduce paradigm. In the tutorial I am following it is said that the map reduce approach tends to be apply two operataions (map and reduce) based on the Key-Value of the file. I know that hadoop deals also with unstructured data so I was wondering how it would handle map reduce in the case of unstructured data.

  1    2   3   4   5   6  ... 下一页 最后一页 共 93 页