Advanced crawler - Use of Scrapy_splash component for JS automatic rendering

Advanced crawler - Use of Scrapy_splash component for JS automatic rendering

1. What is scrapy_splash?

scrapy_splash is a component of scrapy

  • Scrapy-splash loading js data is based on Splash.
  • Splash is a Javascript rendering service. It is a lightweight browser that implements the HTTP API. Splash is implemented in Python and Lua, and is built on modules such as Twisted and QT.
  • The response obtained by using scrapy-splash is equivalent to the source code of the web page after the browser has finished rendering.

Splash official documentation https://splash.readthedocs.io/en/stable/

2. The role of scrapy_splash

scrapy-splash can simulate the browser to load js and return the data after js runs

3. Environment installation of scrapy_splash

3.1 Using the splash docker image

Splash's Dockerfile https://github.com/scrapinghub/splash/blob/master/Dockerfile

It is observed that the splash dependency environment is slightly complicated, so we can directly use the splash docker image

If you do not use the Docker image, please refer to the official splash documentation to install the corresponding dependency environment

3.1.1 Install and start the docker service

Installation reference https://www.jb51.net/article/213611.htm

3.1.2 Get the splash image

Pull the splash image based on the correct installation of Docker

sudo docker pull scrapinghub/splash

3.1.3 Verify that the installation is successful

Run the splash docker service and access port 8050 through the browser to verify whether the installation is successful

  • Run in the frontend: sudo docker run -p 8050:8050 scrapinghub/splash
  • Run in the background sudo docker run -d -p 8050:8050 scrapinghub/splash

Visit http://127.0.0.1:8050 and see the following screenshot, which means success

insert image description here

3.1.4 Solve the problem of image acquisition timeout: modify the docker image source

Take Ubuntu 18.04 as an example

1. Create and edit the docker configuration file

sudo vi /etc/docker/daemon.json

2. Write the mirror address configuration of domestic docker-cn.com and save and exit

{ 
"registry-mirrors": ["https://registry.docker-cn.com"] 
}

3. Restart the computer or docker service and re-obtain the splash image

4. If it is still slow, please use your mobile hotspot (data orz)

3.1.5 Disable splash service

You need to close the container first and then delete it.

sudo docker ps -a
sudo docker stop CONTAINER_ID
sudo docker rm CONTAINER_ID

3.2 Install the scrapy-splash package in the Python virtual environment

pip install scrapy-splash

4. Using splash in scrapy

Take Baidu as an example

4.1 Create a project and create a crawler

scrapy startproject test_splash
cd test_splash
scrapy genspider no_splash baidu.com
scrapy genspider with_splash baidu.com

4.2 Improve the settings.py configuration file

Add splash configuration in settings.py file and modify robots protocol

# Rendering service url
SPLASH_URL = 'http://127.0.0.1:8050'
# Downloader middleware DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# Deduplication filter DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
#Use Splash's Http cache HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

4.3 No splash

Improve in spiders/no_splash.py

import scrapy

class NoSplashSpider(scrapy.Spider):
    name = 'no_splash'
    allowed_domains = ['baidu.com']
    start_urls = ['https://www.baidu.com/s?wd=13161933309']

    def parse(self, response):
        with open('no_splash.html', 'w') as f:
            f.write(response.body.decode())

4.4 Using splash

import scrapy
from scrapy_splash import SplashRequest # Use the request object provided by the scrapy_splash package class WithSplashSpider(scrapy.Spider):
    name = 'with_splash'
    allowed_domains = ['baidu.com']
    start_urls = ['https://www.baidu.com/s?wd=13161933309']

    def start_requests(self):
        yield SplashRequest(self.start_urls[0],
                            callback=self.parse_splash,
                            args={'wait': 10}, # Maximum timeout, unit: seconds endpoint='render.html') # Use fixed parameters of splash service def parse_splash(self, response):
        with open('with_splash.html', 'w') as f:
            f.write(response.body.decode())

4.5 Run two crawlers separately and observe the phenomenon

4.5.1 Run two crawlers separately

scrapy crawl no_splash
scrapy crawl with_splash

4.5.2 Observe the two HTML files obtained

No splash

insert image description here

Using splash

insert image description here

4.6 Conclusion

  • Splash is similar to selenium, and can access the URL address in the request object like a browser
  • Able to send requests in sequence according to the response content corresponding to the URL
  • And render the multiple response contents corresponding to multiple requests
  • Finally, the rendered response object is returned.

5. Learn more

About splash https://www.jb51.net/article/219166.htm

About scrapy_splash (screenshot, get_cookies, etc.) https://www.e-learn.cn/content/qita/800748

6. Summary

1. The role of scrapy_splash component

  • Splash is similar to selenium, and can access the URL address in the request object like a browser
  • Able to send requests in sequence according to the response content corresponding to the URL
  • And render the multiple response contents corresponding to multiple requests
  • Finally, the rendered response object is returned.

2. Use of scrapy_splash component

  • Requires splash service as support
  • The constructed request object becomes splash.SplashRequest
  • Use as download middleware
  • Requires scrapy_splash specific configuration

3. Specific configuration of scrapy_splash

SPLASH_URL = 'http://127.0.0.1:8050'
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

This is the end of this article about advanced crawlers - the use of Scrapy_splash component for JS automatic rendering. For more relevant content on the use of js Scrapy_splash component, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in:
  • Detailed explanation of scrapy-splash simple use

<<:  Tutorial on installing MySQL8 compressed package version on Win10

>>:  Two ways to install the Linux subsystem in Windows 10 (with pictures and text)

Recommend

HTML tag meta summary, HTML5 head meta attribute summary

Preface meta is an auxiliary tag in the head area...

How to configure Java environment variables in Linux system

Configure Java environment variables Here, the en...

Detailed explanation of the use of Vue Smooth DnD, a draggable component of Vue

Table of contents Introduction and Demo API: Cont...

Implementing circular scrolling list function based on Vue

Note: You need to give the parent container a hei...

Installation tutorial of mysql5.7.21 decompression version under win10

Install the unzipped version of Mysql under win10...

linux exa command (better file display experience than ls)

Install Follow the README to install The document...

JavaScript common statements loop, judgment, string to number

Table of contents 1. switch 2. While Loop 3. Do/W...

MySql5.7.21 installation points record notes

The downloaded version is the Zip decompression v...

Ubuntu basic settings: installation and use of openssh-server

Record the installation and use of openssh-server...

Summary of common sql statements in Mysql

1. mysql export file: SELECT `pe2e_user_to_compan...

How to solve the abnormal error ERROR: 2002 in mysql

Recently, an error occurred while starting MySQL....

MySQL 8.0.22.0 download, installation and configuration method graphic tutorial

MySQL 8.0.22 download, installation and configura...

HTML+CSS to create a top navigation bar menu

Navigation bar creation: Technical requirements: ...

MySQL 8.0.23 free installation version configuration detailed tutorial

The first step is to download the free installati...