Data cleaning (ETL) Before running the core business MapReduce program, it is often necessary to clean the data first to remove data that does not meet user requirements. The cleanup process often only requires running the Mapper program, not the Reduce program. 1. need Remove the logs whose field length is less than or equal to 11. (1) Input data web.log (2) Expected output data The length of each line field is greater than 11 2. Demand Analysis The input data needs to be filtered and cleaned according to the rules in the Map stage. 3. Implementation Code (1) Write the LogMapper class package com.atguigu.mapreduce.weblog; import java.io.IOException; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class LogMapper extends Mapper<LongWritable, Text, Text, NullWritable>{ Text k = new Text(); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { // 1 Get 1 line of data String line = value.toString(); // 2 Parse log boolean result = parseLog(line,context); // 3 Log is illegal and exit if (!result) { return; } // 4 Set key k.set(line); // 5 Write data context.write(k, NullWritable.get()); } // 2 Parse log private boolean parseLog(String line, Context context) { // 1 intercept String[] fields = line.split(" "); // 2 Logs with a length greater than 11 are legal if (fields.length > 11) { // System counter context.getCounter("map", "true").increment(1); return true; }else { context.getCounter("map", "false").increment(1); return false; } } } (2) Write the LogDriver class package com.atguigu.mapreduce.weblog; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class LogDriver { public static void main(String[] args) throws Exception { // The input and output paths need to be set according to the actual input and output paths on your computer args = new String[] { "e:/input/inputlog", "e:/output1" }; // 1 Get job information Configuration conf = new Configuration(); Job job = Job.getInstance(conf); // 2 Load the jar package job.setJarByClass(LogDriver.class); // 3 associated maps job.setMapperClass(LogMapper.class); // 4 Set the final output type job.setOutputKeyClass(Text.class); job.setOutputValueClass(NullWritable.class); // Set the number of reducetask to 0 job.setNumReduceTasks(0); // 5 Set input and output paths FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); // 6 Submit job.waitForCompletion(true); } } Summarize The above is the full content of this article. I hope that the content of this article will have certain reference learning value for your study or work. Thank you for your support of 123WORDPRESS.COM. If you want to learn more about this, please check out the following links You may also be interested in:
|
<<: MySQL error: Deadlock found when trying to get lock; try restarting transaction solution
>>: Detailed explanation of formatting numbers in MySQL
Table of contents Docker-Maven-Plugin Maven plugi...
Table of contents v-model .sync The difference in...
1. A container is an independently running applic...
Implementing process analysis (1) How to call rep...
This article shares the specific code of JavaScri...
1. Download Dependency npm install @antv/data-set...
Table of contents 1. Initialize the map 2. Map Po...
Table of contents 1 Conceptual distinction 2 Case...
Table of contents 01 Create invisible columns 02 ...
1. Download Download mysql-5.7.19-linux-glibc2.12...
Block element p - paragraph pre - format text tabl...
1. Current date select DATE_SUB(curdate(),INTERVA...
Why doesn't your height:100% work? This knowl...
Table of contents Problem Description Scenario In...
Table of contents Preface The role of deconstruct...