Scrapy - Получение 504 тайм-аута шлюза после первых запросов
Я использовал Scrapy для удаления контента от нас, и теперь я пытаюсь интегрироваться с Splash для запуска Javascript для страниц. Проблема в том, что когда я запускаю сканер, примерно первые 20 запросов возвращают пустое содержимое, а все остальные запросы возвращают код состояния 504. Почему это происходит?
Это файл журнала:
2018-06-20 10:43:14 [scrapy.core.scraper] WARNING: Dropped:
Not valid item dropped!
{'name': None, 'store': 'Centauro', 'tkbRatio': None, 'description': None, 'salesPrice': None, 'installmentsPrice': None, 'disponibility': True, 'image': None, 'category': None, 'timeStamp': '2018-06-20 13:43:14.875348', 'modifiedTime': None, 'url': 'https://www.centauro.com.br/camisa-compressao-adams-termica-ml-821229.html', 'rating': 0, 'numberOfReviews': 0}
2018-06-20 10:43:14 [centauro] WARNING: Not valid item dropped! https://www.centauro.com.br/camisa-do-brasil-i-2018-nike-masculina-918516.html
2018-06-20 10:43:14 [scrapy.core.scraper] WARNING: Dropped:
Not valid item dropped!
{'name': None, 'store': 'Centauro', 'tkbRatio': None, 'description': None, 'salesPrice': None, 'installmentsPrice': None, 'disponibility': True, 'image': None, 'category': None, 'timeStamp': '2018-06-20 13:43:14.940786', 'modifiedTime': None, 'url': 'https://www.centauro.com.br/camisa-do-brasil-i-2018-nike-masculina-918516.html', 'rating': 0, 'numberOfReviews': 0}
2018-06-20 10:43:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.centauro.com.br/tenis-adidas-duramo-7-lite-masculino-918742.html via http://0.0.0.0:8050/render.html> (referer: None)
2018-06-20 10:43:15 [centauro] WARNING: Not valid item dropped! https://www.centauro.com.br/tenis-adidas-duramo-7-lite-masculino-918742.html
2018-06-20 10:43:15 [scrapy.core.scraper] WARNING: Dropped:
Not valid item dropped!
{'name': None, 'store': 'Centauro', 'tkbRatio': None, 'description': None, 'salesPrice': None, 'installmentsPrice': None, 'disponibility': True, 'image': None, 'category': None, 'timeStamp': '2018-06-20 13:43:15.298537', 'modifiedTime': None, 'url': 'https://www.centauro.com.br/tenis-adidas-duramo-7-lite-masculino-918742.html', 'rating': 0, 'numberOfReviews': 0}
2018-06-20 10:43:22 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.centauro.com.br/tenis-oxer-netuno-masculino-913399.html via http://0.0.0.0:8050/render.html> (failed 1 times): 504 Gateway Time-out
2018-06-20 10:43:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.centauro.com.br/calca-termica-kappa-belquior-masculina-910118.html via http://0.0.0.0:8050/render.html> (failed 1 times): 504 Gateway Time-out
2018-06-20 10:43:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.centauro.com.br/jaqueta-oxer-water-repelent-feminina-858050.html via http://0.0.0.0:8050/render.html> (failed 1 times): 504 Gateway Time-out
2018-06-20 10:43:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.centauro.com.br/camiseta-do-brasil-2018-crest-nike-masculina-918483.html via http://0.0.0.0:8050/render.html> (failed 1 times): 504 Gateway Time-out
2018-06-20 10:43:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.centauro.com.br/tenis-fila-infinity-m00kil-mktp.html via http://0.0.0.0:8050/render.html> (failed 1 times): 504 Gateway Time-out
Это основной метод для моего паука, чтобы начать утилизацию:
def start_requests(self):
mode = self.settings.get('MODE')
urls = util.get_urls_db(self.custom_settings['URLS_COLLECTION_NAME'])
urls = list(urls)
if mode == 'all':
for url in urls:
yield SplashRequest(url['url'], self.parse_item,
args={
# optional; parameters passed to Splash HTTP API
'timeout': 10,
# 'url' is prefilled from request url
# 'http_method' is set to 'POST' for POST requests
# 'body' is set to request body for POST requests
}
)
А это мой settings.py
:
SPLASH_URL = 'http://0.0.0.0:8050'
DOWNLOADER_MIDDLEWARES = {
# 'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
# 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
# 'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
# 'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
# 'updater.middlewares.SeleniumMiddleware': 700,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
2 ответа
добавление этого конкретного кода в yield SplashRequest решает мою проблему args={"timeout": 3000}
так :
yield SplashRequest (url, args = {"timeout": 3000}))