Install First you need to install Java and Scala, then download Spark and install it, make sure PATH and JAVA_HOME are set, and then you need to use Scala's SBT to build Spark as follows: $ sbt/sbt assembly The build time is relatively long. Once the build is complete, verify that the installation was successful by running: $ ./bin/spark-shell scala> val textFile = sc.textFile("README.md") // Create a reference to README.md scala> textFile.count // Count the number of lines in this file scala> textFile.first // Print the first line Apache Access Log Analyzer First we need to use Scala to write an analyzer for Apache access logs. Fortunately, someone has already written it. Download the Apache logfile parser code. Use SBT to compile and package: sbt compile sbt test sbt package The package name is assumed to be AlsApacheLogParser.jar. // this works $ MASTER=local[4] SPARK_CLASSPATH=AlsApacheLogParser.jar ./bin/spark-shell For Spark 0.9, some methods do not work: // does not work $ MASTER=local[4] ADD_JARS=AlsApacheLogParser.jar ./bin/spark-shell // does not work spark> :cp AlsApacheLogParser.jar After the upload is successful, create an AccessLogParser instance in the Spark REPL: import com.alvinalexander.accesslogparser._ val p = new AccessLogParser Now you can read the Apache access log accesslog.small just like reading readme.cmd before: scala> val log = sc.textFile("accesslog.small") 14/03/09 11:25:23 INFO MemoryStore: ensureFreeSpace(32856) called with curMem=0, maxMem=309225062 14/03/09 11:25:23 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 32.1 KB, free 294.9 MB) log: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:15 scala> log.count (a lot of output here) res0: Long = 100000 Analyzing Apache logs We can analyze how many 404s there are in the Apache log. The creation method is as follows: def getStatusCode(line: Option[AccessLogRecord]) = { line match { case Some(l) => l.httpStatusCode case None => "0" } } Option[AccessLogRecord] is the return value of the analyzer. Then use it in the Spark command line as follows: log.filter(line => getStatusCode(p.parseRecord(line)) == "404").count This statistic will return the number of rows where the httpStatusCode is 404. Digging Deeper Now if we want to know which URLs are problematic, such as a space in the URL that causes a 404 error, the following steps are obviously required:
Create the following method: // get the `request` field from an access log record def getRequest(rawAccessLogString: String): Option[String] = { val accessLogRecordOption = p.parseRecord(rawAccessLogString) accessLogRecordOption match { case Some(rec) => Some(rec.request) case None => None } } Paste this code into the Spark REPL and run the following code: log.filter(line => getStatusCode(p.parseRecord(line)) == "404").map(getRequest(_)).count val recs = log.filter(line => getStatusCode(p.parseRecord(line)) == "404").map(getRequest(_)) val distinctRecs = log.filter(line => getStatusCode(p.parseRecord(line)) == "404").map(getRequest(_)).distinct distinctRecs.foreach(println) Summarize For simple analysis of access logs, grep is of course a better choice, but more complex queries require Spark. It is difficult to judge the performance of Spark on a single system. This is because Spark is designed for distributed systems with large files. The above is the full content of this article. I hope it will be helpful for everyone’s study. I also hope that everyone will support 123WORDPRESS.COM. You may also be interested in:
|
<<: How to configure SSL for koa2 service
>>: Mysql5.7.17 winx64.zip decompression version installation and configuration graphic tutorial
In the process of learning web design, I did not ...
I use Navicat as my database tool. Others are sim...
1. Connect to MYSQL Format: mysql -h host address...
Download link: Operating Environment CentOS 7.6 i...
Table of contents Linux 1. Basic use of crontab 2...
Limit input box to only pure numbers 1、onkeyup = ...
Platform deployment 1. Install JDK step1. Downloa...
Table of contents 1. Get the file extension 2. Co...
Gtid + Mha + Binlog server configuration: 1: Test...
Table of contents Requirements: Implementation st...
Modify the group to which a user belongs in Linux...
I recently bought a Tencent Cloud server and buil...
Table of contents Written in front Environment de...
mysql create table sql statement Common SQL state...
<br />There is no road in the world. When mo...