Advanced crawler - Use of Scrapy_splash component for JS automatic rendering

1. What is scrapy_splash?
2. The role of scrapy_splash
3. Environment installation of scrapy_splash

3.1 Using the splash docker image
3.2 Install the scrapy-splash package in the Python virtual environment

4. Using splash in scrapy

4.1 Create a project and create a crawler
4.2 Improve the settings.py configuration file
4.3 No splash
4.4 Using splash
4.5 Run two crawlers separately and observe the phenomenon
4.6 Conclusion

5. Learn more

6. Summary

1. What is scrapy_splash?

scrapy_splash is a component of scrapy

Scrapy-splash loading js data is based on Splash.
Splash is a Javascript rendering service. It is a lightweight browser that implements the HTTP API. Splash is implemented in Python and Lua, and is built on modules such as Twisted and QT.
The response obtained by using scrapy-splash is equivalent to the source code of the web page after the browser has finished rendering.

Splash official documentation https://splash.readthedocs.io/en/stable/

2. The role of scrapy_splash

scrapy-splash can simulate the browser to load js and return the data after js runs

3. Environment installation of scrapy_splash

3.1 Using the splash docker image

Splash's Dockerfile https://github.com/scrapinghub/splash/blob/master/Dockerfile

It is observed that the splash dependency environment is slightly complicated, so we can directly use the splash docker image

If you do not use the Docker image, please refer to the official splash documentation to install the corresponding dependency environment

3.1.1 Install and start the docker service

Installation reference https://www.jb51.net/article/213611.htm

3.1.2 Get the splash image

Pull the splash image based on the correct installation of Docker

sudo docker pull scrapinghub/splash

3.1.3 Verify that the installation is successful

Run the splash docker service and access port 8050 through the browser to verify whether the installation is successful

Run in the frontend: sudo docker run -p 8050:8050 scrapinghub/splash
Run in the background sudo docker run -d -p 8050:8050 scrapinghub/splash

Visit http://127.0.0.1:8050 and see the following screenshot, which means success

insert image description here

3.1.4 Solve the problem of image acquisition timeout: modify the docker image source

Take Ubuntu 18.04 as an example

1. Create and edit the docker configuration file

sudo vi /etc/docker/daemon.json

2. Write the mirror address configuration of domestic docker-cn.com and save and exit

{ 
"registry-mirrors": ["https://registry.docker-cn.com"] 
}

3. Restart the computer or docker service and re-obtain the splash image

4. If it is still slow, please use your mobile hotspot (data orz)

3.1.5 Disable splash service

You need to close the container first and then delete it.

sudo docker ps -a
sudo docker stop CONTAINER_ID
sudo docker rm CONTAINER_ID

3.2 Install the scrapy-splash package in the Python virtual environment

pip install scrapy-splash

4. Using splash in scrapy

Take Baidu as an example

4.1 Create a project and create a crawler

scrapy startproject test_splash
cd test_splash
scrapy genspider no_splash baidu.com
scrapy genspider with_splash baidu.com

4.2 Improve the settings.py configuration file

Add splash configuration in settings.py file and modify robots protocol

# Rendering service url
SPLASH_URL = 'http://127.0.0.1:8050'
# Downloader middleware DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
# Deduplication filter DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
#Use Splash's Http cache HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

4.3 No splash

Improve in spiders/no_splash.py

import scrapy

class NoSplashSpider(scrapy.Spider):
    name = 'no_splash'
    allowed_domains = ['baidu.com']
    start_urls = ['https://www.baidu.com/s?wd=13161933309']

    def parse(self, response):
        with open('no_splash.html', 'w') as f:
            f.write(response.body.decode())

4.4 Using splash

import scrapy
from scrapy_splash import SplashRequest # Use the request object provided by the scrapy_splash package class WithSplashSpider(scrapy.Spider):
    name = 'with_splash'
    allowed_domains = ['baidu.com']
    start_urls = ['https://www.baidu.com/s?wd=13161933309']

    def start_requests(self):
        yield SplashRequest(self.start_urls[0],
                            callback=self.parse_splash,
                            args={'wait': 10}, # Maximum timeout, unit: seconds endpoint='render.html') # Use fixed parameters of splash service def parse_splash(self, response):
        with open('with_splash.html', 'w') as f:
            f.write(response.body.decode())

4.5 Run two crawlers separately and observe the phenomenon

4.5.1 Run two crawlers separately

scrapy crawl no_splash
scrapy crawl with_splash

4.5.2 Observe the two HTML files obtained

No splash

insert image description here

Using splash

insert image description here

4.6 Conclusion

Splash is similar to selenium, and can access the URL address in the request object like a browser
Able to send requests in sequence according to the response content corresponding to the URL
And render the multiple response contents corresponding to multiple requests
Finally, the rendered response object is returned.

5. Learn more

About splash https://www.jb51.net/article/219166.htm

About scrapy_splash (screenshot, get_cookies, etc.) https://www.e-learn.cn/content/qita/800748

6. Summary

1. The role of scrapy_splash component

Splash is similar to selenium, and can access the URL address in the request object like a browser
Able to send requests in sequence according to the response content corresponding to the URL
And render the multiple response contents corresponding to multiple requests
Finally, the rendered response object is returned.

2. Use of scrapy_splash component

Requires splash service as support
The constructed request object becomes splash.SplashRequest
Use as download middleware
Requires scrapy_splash specific configuration

3. Specific configuration of scrapy_splash

SPLASH_URL = 'http://127.0.0.1:8050'
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

This is the end of this article about advanced crawlers - the use of Scrapy_splash component for JS automatic rendering. For more relevant content on the use of js Scrapy_splash component, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future!

You may also be interested in: