1. What is scrapy_splash?scrapy_splash is a component of scrapy
2. The role of scrapy_splashscrapy-splash can simulate the browser to load js and return the data after js runs 3. Environment installation of scrapy_splash3.1 Using the splash docker image
It is observed that the splash dependency environment is slightly complicated, so we can directly use the splash docker image If you do not use the Docker image, please refer to the official splash documentation to install the corresponding dependency environment 3.1.1 Install and start the docker service
3.1.2 Get the splash image
3.1.3 Verify that the installation is successful
Visit 3.1.4 Solve the problem of image acquisition timeout: modify the docker image source
1. Create and edit the docker configuration file
2. Write the mirror address configuration of domestic docker-cn.com and save and exit { "registry-mirrors": ["https://registry.docker-cn.com"] } 3. Restart the computer or docker service and re-obtain the splash image 4. If it is still slow, please use your mobile hotspot (data orz) 3.1.5 Disable splash service
sudo docker ps -a sudo docker stop CONTAINER_ID sudo docker rm CONTAINER_ID 3.2 Install the scrapy-splash package in the Python virtual environment
4. Using splash in scrapy
4.1 Create a project and create a crawlerscrapy startproject test_splash cd test_splash scrapy genspider no_splash baidu.com scrapy genspider with_splash baidu.com 4.2 Improve the settings.py configuration file Add splash configuration in # Rendering service url SPLASH_URL = 'http://127.0.0.1:8050' # Downloader middleware DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } # Deduplication filter DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' #Use Splash's Http cache HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' # Obey robots.txt rules ROBOTSTXT_OBEY = False 4.3 No splash Improve in import scrapy class NoSplashSpider(scrapy.Spider): name = 'no_splash' allowed_domains = ['baidu.com'] start_urls = ['https://www.baidu.com/s?wd=13161933309'] def parse(self, response): with open('no_splash.html', 'w') as f: f.write(response.body.decode()) 4.4 Using splashimport scrapy from scrapy_splash import SplashRequest # Use the request object provided by the scrapy_splash package class WithSplashSpider(scrapy.Spider): name = 'with_splash' allowed_domains = ['baidu.com'] start_urls = ['https://www.baidu.com/s?wd=13161933309'] def start_requests(self): yield SplashRequest(self.start_urls[0], callback=self.parse_splash, args={'wait': 10}, # Maximum timeout, unit: seconds endpoint='render.html') # Use fixed parameters of splash service def parse_splash(self, response): with open('with_splash.html', 'w') as f: f.write(response.body.decode()) 4.5 Run two crawlers separately and observe the phenomenon4.5.1 Run two crawlers separately scrapy crawl no_splash scrapy crawl with_splash 4.5.2 Observe the two HTML files obtained No splash Using splash 4.6 Conclusion
5. Learn more
6. Summary1. The role of scrapy_splash component
2. Use of scrapy_splash component
3. Specific configuration of scrapy_splash SPLASH_URL = 'http://127.0.0.1:8050' DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' This is the end of this article about advanced crawlers - the use of Scrapy_splash component for JS automatic rendering. For more relevant content on the use of js Scrapy_splash component, please search for previous articles on 123WORDPRESS.COM or continue to browse the following related articles. I hope everyone will support 123WORDPRESS.COM in the future! You may also be interested in:
|
<<: Tutorial on installing MySQL8 compressed package version on Win10
>>: Two ways to install the Linux subsystem in Windows 10 (with pictures and text)
Personal implementation screenshots: Install: npm...
Table of contents background accomplish Supplemen...
GreaseMokey (Chinese people call it Grease Monkey...
1. Requirements: Database backup is particularly ...
The position property The position property speci...
Copy code The code is as follows: <input type=...
Table of contents 1. ChildNodes attribute travers...
Heart Attributes opacity: .999 creates a stacking...
This article shares the specific code of React to...
To perform incremental backup of the MySQL databa...
Table of contents Case 1: Case 2: Case 3: To summ...
Fast-Linux project address: https://gitee.com/uit...
Lambda Expressions Lambda expressions, also known...
Table of contents Preface: 1.Brief introduction t...
Preface When we write code, we occasionally encou...