Я хочу получить имя хоста и путь отдельно от тега href в Python

Question

Я хочу получить имя хоста и путь отдельно от тега href в Python

У меня есть код Python, откуда я хочу получить имя хоста и путь отдельно. например, www.stackru.com/questions/ask Я хочу получить такой результат: "имя хоста: www.stackru.com и путь: /questions/ask"

Вот мой код Python:

import urllib
from bs4 import BeautifulSoup
import urlparse
import mechanize
import socket
import errno
import io
from nyt4 import articalText

url = "http://www.nytimes.com/section/health"
br = mechanize.Browser()
br.set_handle_equiv(False) 
htmltext = br.open(url)
#htmltext = urllib.urlopen(url).read()
soup = BeautifulSoup(htmltext)
maindiv = soup.findAll('section', attrs={'class':'health-collection collection'})
for links in  maindiv:
    atags = soup.findAll('a',href=True)
    for link in atags:
        alinks= link.get('href')
        print alinks.hostname
        print alinks.path

Но этот код дает мне эту ошибку:

Traceback (most recent call last):
  File "<pyshell#18>", line 1, in <module>
    execfile("nytimes/test2.py")
  File "nytimes/test2.py", line 21, in <module>
    print alinks.hostname
AttributeError: 'unicode' object has no attribute 'hostname'

-1

html python-2.7 beautifulsoup www-mechanize

Источник

user6813536 09 окт '16 в 08:16

1 ответ

Решение

Другие вопросы по тегам html python-2.7 beautifulsoup www-mechanize

user2141635 09 окт '16 в 10:24 2016-10-09 10:24 · Accepted Answer · 2016-10-09 10:24

alinks= link.get('href') который устанавливает alinks на строку, которая определенно не имеет атрибута hostname или path, вы можете использовать urlparse для получения пути и имени хоста:

import mechanize
from bs4 import BeautifulSoup
from urlparse import urlparse

url = "http://www.nytimes.com/section/health"
br = mechanize.Browser()
br.set_handle_equiv(False)
htmltext = br.open(url)
#htmltext = urllib.urlopen(url).read()
soup = BeautifulSoup(htmltext)
maindiv = soup.find_all('section', attrs={'class':'health-collection collection'})
for links in  maindiv:
    atags = soup.find_all('a',href=True)
    for link in atags:
        alinks = urlparse(link.get('href'))
        print alinks.hostname
        print alinks.path