Как использовать Splash (JS Rendering Service) с прокси
Это настраивается автоматически в Scrapy, но не в Curl или обычном запросе.
В curl мы можем сделать это без какого-либо прокси:
Как это сделать с прокси?
Я попробовал это:
http://<server_ip>:8050/render.html?url=http://www.example.com/?timeout=10&wait=0.5 --proxy myproxy:port
Но я получил:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Lightspeed Systems - Web Access</title>
<div id="titles">
<h2>Unable to complete URL request</h2>
<div id="content">
<p>An error has occurred while trying to access <a href="http://<server_ip>:8050/render.html?">http://<server_ip>:8050/render.html?</a>.</p>
<blockquote id="error">
<p><b>Access denied.</b></p>
<p>Security permissions are not allowing the request attempt. Please contact your service provider if you feel this is incorrect.</p>
<div id="footer">
C:\Users\Dr. Printer>curl "http://<server_ip>:8050/render.html?url=http://www.example.com/?timeout=30&wait=0.5"
{"description": "Timeout exceeded rendering page", "type": "GlobalTimeoutError", "info": {"timeout": 30.0}, "error": 504}
1 ответ
Если мы хотим использовать Crawlera в качестве прокси, мы можем сделать это, используя этот скрипт lua
function use_crawlera(splash)
-- Make sure you pass your Crawlera API key in the 'crawlera_user' arg.
-- Have a look at the file spiders/quotes-js.py to see how to do it.
-- Find your Crawlera credentials in https://app.scrapinghub.com/
local user = splash.args.crawlera_user
local host = 'proxy.crawlera.com'
local port = 8010
local session_header = 'X-Crawlera-Session'
local session_id = 'create'
splash:on_request(function (request)
-- The commented code below can be used to speed up the crawling
-- process. They filter requests to undesired domains and useless
-- resources. Uncomment the ones that make sense to your use case
-- and add your own rules.
-- Discard requests to advertising and tracking domains.
if string.find(request.url, 'doubleclick%.net') or
string.find(request.url, 'analytics%.google%.com') then
-- Avoid using Crawlera for subresources fetching to increase crawling
-- speed. The example below avoids using Crawlera for URLS starting
-- with 'static.' and the ones ending with '.png'.
if string.find(request.url, '://static%.') ~= nil or
string.find(request.url, '%.png$') ~= nil then
request:set_header('X-Crawlera-Cookies', 'disable')
request:set_header(session_header, session_id)
request:set_proxy{{host, port, username=user, password=''}}
splash:on_response_headers(function (response)
if type(response.headers[session_header]) ~= nil then
session_id = response.headers[session_header]
function main(splash)
return {{
html = splash:html(),
cookies = splash:get_cookies(),
Не забудьте установить scrapy-crawlera
и активировать его в настройках. Для получения дополнительной информации, пожалуйста, см. https://support.scrapinghub.com/support/solutions/articles/22000188428-using-crawlera-with-splash-scrapy