1. What is scrapy_splash?scrapy_splash is a component of scrapy
2. The role of scrapy_splashscrapy-splash can simulate the browser to load js and return the data after js runs 3. Environment installation of scrapy_splash3.1 Using the splash docker image
It is observed that the splash dependency environment is slightly complicated, so we can directly use the splash docker image If you do not use the Docker image, please refer to the official splash documentation to install the corresponding dependency environment 3.1.1 Install and start the docker service
3.1.2 Get the splash image
3.1.3 Verify that the installation is successful
Visit 3.1.4 Solve the problem of image acquisition timeout: modify the docker image source
1. Create and edit the docker configuration file
2. Write the mirror address configuration of domestic docker-cn.com and save and exit { "registry-mirrors": ["https://registry.docker-cn.com"] } 3. Restart the computer or docker service and re-obtain the splash image 4. If it is still slow, please use your mobile hotspot (data orz) 3.1.5 Disable splash service
sudo docker ps -a sudo docker stop CONTAINER_ID sudo docker rm CONTAINER_ID 3.2 Install the scrapy-splash package in the Python virtual environment
4. Using splash in scrapy
4.1 Create a project and create a crawlerscrapy startproject test_splash cd test_splash scrapy genspider no_splash baidu.com scrapy genspider with_splash baidu.com 4.2 Improve the settings.py configuration file Add splash configuration in # Rendering service url SPLASH_URL = 'http://127.0.0.1:8050' # Downloader middleware DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } # Deduplication filter DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' #Use Splash's Http cache HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' # Obey robots.txt rules ROBOTSTXT_OBEY = False 4.3 No splash Improve in import scrapy class NoSplashSpider(scrapy.Spider): name = 'no_splash' allowed_domains = ['baidu.com'] start_urls = ['https://www.baidu.com/s?wd=13161933309'] def parse(self, response): with open('no_splash.html', 'w') as f: f.write(response.body.decode()) 4.4 Using splashimport scrapy from scrapy_splash import SplashRequest # Use the request object provided by the scrapy_splash package class WithSplashSpider(scrapy.Spider): name = 'with_splash' allowed_domains = ['baidu.com'] start_urls = ['https://www.baidu.com/s?wd=13161933309'] def start_requests(self): yield SplashRequest(self.start_urls[0], callback=self.parse_splash, args={'wait': 10}, # Maximum timeout, unit: seconds endpoint='render.html') # Use fixed parameters of splash service def parse_splash(self, response): with open('with_splash.html', 'w') as f: f.write(response.body.decode()) 4.5 Run two crawlers separately and observe the phenomenon4.5.1 Run two crawlers separately scrapy crawl no_splash scrapy crawl with_splash 4.5.2 Observe the two HTML files obtained No splash Using splash 4.6 Conclusion
5. Learn more
6. Summary1. The role of scrapy_splash component
2. Use of scrapy_splash component
3. Specific configuration of scrapy_splash SPLASH_URL = 'http://127.0.0.1:8050' DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' This is the end of this article about advanced crawlers - the use of Scrapy_splash component for JS automatic rendering. For more relevant content on the use of js Scrapy_splash component, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future! You may also be interested in:
|
<<: Tutorial on installing MySQL8 compressed package version on Win10
>>: Two ways to install the Linux subsystem in Windows 10 (with pictures and text)
Preface meta is an auxiliary tag in the head area...
Configure Java environment variables Here, the en...
Table of contents Introduction and Demo API: Cont...
Note: You need to give the parent container a hei...
Seurat is a heavyweight R package for single-cell...
Install the unzipped version of Mysql under win10...
Install Follow the README to install The document...
Table of contents 1. switch 2. While Loop 3. Do/W...
The downloaded version is the Zip decompression v...
Record the installation and use of openssh-server...
1. mysql export file: SELECT `pe2e_user_to_compan...
Recently, an error occurred while starting MySQL....
MySQL 8.0.22 download, installation and configura...
Navigation bar creation: Technical requirements: ...
The first step is to download the free installati...