Scrapy with Tor, Polipo in (Ubuntu): как обновить IP
Я добавил следующий код в файл /etc/polipo/config:
# This file only needs to list configuration variables that deviate
# from the default values. See /usr/share/doc/polipo/examples/config.sample
# and "polipo -v" for variables you can tweak and further information.
logSyslog = true
logFile = /var/log/polipo/polipo.log
socksParentProxy = localhost:9050
diskCacheRoot=""
disableLocalInterface=""
Добавил этот код в файл settings.py:
# More comprehensive list can be found at
# http://techpatterns.com/forums/about304.html
USER_AGENT_LIST = [
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7',
'Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0) Gecko/16.0 Firefox/16.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10'
]
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'myproject.middlewares.ProxyMiddleware': 410,
'myproject.scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
# Disable compression middleware, so the actual HTML pages are cached
}
Удалил файл middlewares.py из папки проекта scrapy и добавил файл middlewares.py со следующим кодом:
#TOR proxy
import os
import random
from scrapy.conf import settings
class RandomUserAgentMiddleware(object):
def process_request(self, request, spider):
ua = random.choice(settings.get('USER_AGENT_LIST'))
if ua:
request.headers.setdefault('User-Agent', ua)
class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = settings.get('HTTP_PROXY')
Настройте параметры сети браузера TOR следующим образом:
Proxy type: HTTP/HTTPS
Address: localhost
Port: 9050
Запустил полипо с помощью следующей команды:
sudo /etc/init.d/polipo restart
Кажется, он не сканирует сайт. Это мой журнал:
scrapy crawl event -o items.csv
2018-09-13 21:02:02 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: tentimes)
2018-09-13 21:02:02 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 2.7.12 (default, Dec 4 2017, 14:50:18) - [GCC 5.4.0 20160609], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Linux-4.15.0-33-generic-x86_64-with-Ubuntu-16.04-xenial
2018-09-13 21:02:02 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tentimes.spiders', 'FEED_URI': 'items.csv', 'SPIDER_MODULES': ['tentimes.spiders'], 'BOT_NAME': 'tentimes', 'ROBOTSTXT_OBEY': True, 'FEED_FORMAT': 'csv'}
2018-09-13 21:02:02 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2018-09-13 21:02:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'tentimes.middlewares.RandomUserAgentMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'tentimes.middlewares.ProxyMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-09-13 21:02:02 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-09-13 21:02:02 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-09-13 21:02:02 [scrapy.core.engine] INFO: Spider opened
2018-09-13 21:02:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-09-13 21:02:02 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-09-13 21:03:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)