Advanced crawler - Use of Scrapy_splash component for JS automatic rendering

Advanced crawler - Use of Scrapy_splash component for JS automatic rendering

1. What is scrapy_splash?

scrapy_splash is a component of scrapy

  • Scrapy-splash loading js data is based on Splash.
  • Splash is a Javascript rendering service. It is a lightweight browser that implements the HTTP API. Splash is implemented in Python and Lua, and is built on modules such as Twisted and QT.
  • The response obtained by using scrapy-splash is equivalent to the source code of the web page after the browser has finished rendering.

Splash official documentation https://splash.readthedocs.io/en/stable/

2. The role of scrapy_splash

scrapy-splash can simulate the browser to load js and return the data after js runs

3. Environment installation of scrapy_splash

3.1 Using the splash docker image

Splash's Dockerfile https://github.com/scrapinghub/splash/blob/master/Dockerfile

It is observed that the splash dependency environment is slightly complicated, so we can directly use the splash docker image

If you do not use the Docker image, please refer to the official splash documentation to install the corresponding dependency environment

3.1.1 Install and start the docker service

Installation reference https://www.jb51.net/article/213611.htm

3.1.2 Get the splash image

Pull the splash image based on the correct installation of Docker

sudo docker pull scrapinghub/splash

3.1.3 Verify that the installation is successful

Run the splash docker service and access port 8050 through the browser to verify whether the installation is successful

  • Run in the frontend: sudo docker run -p 8050:8050 scrapinghub/splash
  • Run in the background sudo docker run -d -p 8050:8050 scrapinghub/splash

Visit http://127.0.0.1:8050 and see the following screenshot, which means success

insert image description here

3.1.4 Solve the problem of image acquisition timeout: modify the docker image source

Take Ubuntu 18.04 as an example

1. Create and edit the docker configuration file

sudo vi /etc/docker/daemon.json

2. Write the mirror address configuration of domestic docker-cn.com and save and exit

{ 
"registry-mirrors": ["https://registry.docker-cn.com"] 
}

3. Restart the computer or docker service and re-obtain the splash image

4. If it is still slow, please use your mobile hotspot (data orz)

3.1.5 Disable splash service

You need to close the container first and then delete it.

sudo docker ps -a
sudo docker stop CONTAINER_ID
sudo docker rm CONTAINER_ID

3.2 Install the scrapy-splash package in the Python virtual environment

pip install scrapy-splash

4. Using splash in scrapy

Take Baidu as an example

4.1 Create a project and create a crawler

scrapy startproject test_splash
cd test_splash
scrapy genspider no_splash baidu.com
scrapy genspider with_splash baidu.com

4.2 Improve the settings.py configuration file

Add splash configuration in settings.py file and modify robots protocol

# Rendering service url
SPLASH_URL = 'http://127.0.0.1:8050'
# Downloader middleware DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# Deduplication filter DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
#Use Splash's Http cache HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

4.3 No splash

Improve in spiders/no_splash.py

import scrapy

class NoSplashSpider(scrapy.Spider):
    name = 'no_splash'
    allowed_domains = ['baidu.com']
    start_urls = ['https://www.baidu.com/s?wd=13161933309']

    def parse(self, response):
        with open('no_splash.html', 'w') as f:
            f.write(response.body.decode())

4.4 Using splash

import scrapy
from scrapy_splash import SplashRequest # Use the request object provided by the scrapy_splash package class WithSplashSpider(scrapy.Spider):
    name = 'with_splash'
    allowed_domains = ['baidu.com']
    start_urls = ['https://www.baidu.com/s?wd=13161933309']

    def start_requests(self):
        yield SplashRequest(self.start_urls[0],
                            callback=self.parse_splash,
                            args={'wait': 10}, # Maximum timeout, unit: seconds endpoint='render.html') # Use fixed parameters of splash service def parse_splash(self, response):
        with open('with_splash.html', 'w') as f:
            f.write(response.body.decode())

4.5 Run two crawlers separately and observe the phenomenon

4.5.1 Run two crawlers separately

scrapy crawl no_splash
scrapy crawl with_splash

4.5.2 Observe the two HTML files obtained

No splash

insert image description here

Using splash

insert image description here

4.6 Conclusion

  • Splash is similar to selenium, and can access the URL address in the request object like a browser
  • Able to send requests in sequence according to the response content corresponding to the URL
  • And render the multiple response contents corresponding to multiple requests
  • Finally, the rendered response object is returned.

5. Learn more

About splash https://www.jb51.net/article/219166.htm

About scrapy_splash (screenshot, get_cookies, etc.) https://www.e-learn.cn/content/qita/800748

6. Summary

1. The role of scrapy_splash component

  • Splash is similar to selenium, and can access the URL address in the request object like a browser
  • Able to send requests in sequence according to the response content corresponding to the URL
  • And render the multiple response contents corresponding to multiple requests
  • Finally, the rendered response object is returned.

2. Use of scrapy_splash component

  • Requires splash service as support
  • The constructed request object becomes splash.SplashRequest
  • Use as download middleware
  • Requires scrapy_splash specific configuration

3. Specific configuration of scrapy_splash

SPLASH_URL = 'http://127.0.0.1:8050'
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

This is the end of this article about advanced crawlers - the use of Scrapy_splash component for JS automatic rendering. For more relevant content on the use of js Scrapy_splash component, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • Detailed explanation of scrapy-splash simple use

<<:  Tutorial on installing MySQL8 compressed package version on Win10

>>:  Two ways to install the Linux subsystem in Windows 10 (with pictures and text)

Recommend

Use vue to implement handwritten signature function

Personal implementation screenshots: Install: npm...

How to import js configuration file on Vue server

Table of contents background accomplish Supplemen...

Chrome 4.0 supports GreaseMonkey scripts

GreaseMokey (Chinese people call it Grease Monkey...

Writing daily automatic backup of MySQL database using mysqldump in Centos7

1. Requirements: Database backup is particularly ...

Detailed explanation of the usage of 5 different values ​​of CSS position

The position property The position property speci...

React's method of realizing secondary linkage

This article shares the specific code of React to...

Ideas and methods for incremental backup of MySQL database

To perform incremental backup of the MySQL databa...

Several situations that cause MySQL to perform a full table scan

Table of contents Case 1: Case 2: Case 3: To summ...

Lambda expression principles and examples

Lambda Expressions Lambda expressions, also known...

Detailed explanation of MySQL DEFINER usage

Table of contents Preface: 1.Brief introduction t...

JS implements the sample code of decimal conversion to hexadecimal

Preface When we write code, we occasionally encou...