Поля, отсутствующие при разборе XML с Python
Я пытаюсь собрать все данные о доме, используя API Zillow. Я получаю некоторые поля, а другие возвращаются как нулевые.
Вот мой код Python:
from bs4 import BeautifulSoup
import requests
import urllib, urllib2
import csv
url = requests.get("https://raw.github.com/rfarley90/random/master/zillowresults.html")
pageText = url.text
soup = BeautifulSoup(pageText)
useCode = soup.find('useCode')
taxAssessmentYear = soup.find('taxAssessmentYear')
taxAssessment = soup.find('taxAssessment')
yearBuilt = soup.find('yearBuilt')
lotSizeSqFt = soup.find('lotSizeSqFt')
finishedSqFt = soup.find('finishedSqFt')
bathrooms = soup.find('bathrooms')
lastSoldDate = soup.find('lastSoldDate')
lastSoldPrice = soup.find('lastSoldPrice')
zestimate = soup.find('zestimate')
amount = soup.find('amount')
lastupdated = soup.find('last-updated')
valueChangeduration = soup.find('valueChange')
valuationRange = soup.find('valuationRange')
lowcurrency = soup.find('low')
highcurrency = soup.find('high')
percentile = soup.find('percentile')
localRealEstate = soup.find('localRealEstate')
region = soup.find('region')
links = soup.find('links')
overview = soup.find('overview')
forSaleByOwner = soup.find('forSaleByOwner')
forSale = soup.find('forSale')
array = [
['useCode ' , useCode],
['taxAssessmentYear ' , taxAssessmentYear],
['taxAssessment ' , taxAssessment],
['yearBuilt ' , yearBuilt],
['lotSizeSqFt ' , lotSizeSqFt],
['finishedSqFt ' , finishedSqFt],
['bathrooms ' , bathrooms],
['lastSoldDate ' , lastSoldDate],
['lastSoldPrice ' , lastSoldPrice],
['zestimate ' , zestimate],
['amount ' , amount],
['lastupdated ' , lastupdated],
['valueChangeduration ' , valueChangeduration],
['valuationRange ' , valuationRange],
['lowcurrency ' , lowcurrency],
['highcurrency ' , highcurrency],
['percentile ' , percentile],
['localRealEstate ' , localRealEstate],
['region ' , region],
['links ' , links],
['overview ' , overview],
['forSaleByOwner ' , forSaleByOwner],
['forSale ' , forSale]]
for x in array:
print x
Результаты, которые я получаю, имеют много пропущенных значений, как показано ниже:
['useCode ', None]
['taxAssessmentYear ', None]
['taxAssessment ', None]
['yearBuilt ', None]
['lotSizeSqFt ', None]
['finishedSqFt ', None]
['bathrooms ', <bathrooms>2.0</bathrooms>]
['lastSoldDate ', None]
['lastSoldPrice ', None]
['zestimate ', <zestimate>
<amount currency="USD">977262</amount>
<last-updated>01/23/2014</last-updated>
<oneweekchange deprecated="true">
<valuechange currency="USD" duration="30">-25723</valuechange>
<valuationrange>
<low currency="USD">928399</low>
<high currency="USD">1055443</high>
</valuationrange>
<percentile>0</percentile>
</oneweekchange></zestimate>]
['amount ', <amount currency="USD">977262</amount>]
['lastupdated ', <last-updated>01/23/2014</last-updated>]
['valueChangeduration ', None]
['valuationRange ', None]
['lowcurrency ', <low currency="USD">928399</low>]
['highcurrency ', <high currency="USD">1055443</high>]
['percentile ', <percentile>0</percentile>]
['localRealEstate ', None]
['region ', <region id="46465" name="Mc Lean" type="city">
<links>
<overview>
http://www.zillow.com/local-info/VA-Mc-Lean/r_46465/
</overview>
<forsalebyowner>http://www.zillow.com/mc-lean-va/fsbo/</forsalebyowner>
<forsale>http://www.zillow.com/mc-lean-va/</forsale>
</links>
</region>]
['links ', <links>
<homedetails>
http://www.zillow.com/homedetails/6870-Churchill-Rd-Mc-Lean-VA-22101/51751742_zpid/
</homedetails>
<graphsanddata>
http://www.zillow.com/homedetails/6870-Churchill-Rd-Mc-Lean-VA-22101/51751742_zpid/#charts-and-data
</graphsanddata>
<mapthishome>http://www.zillow.com/homes/51751742_zpid/</mapthishome>
<comparables>http://www.zillow.com/homes/comps/51751742_zpid/</comparables>
</links>]
['overview ', <overview>
http://www.zillow.com/local-info/VA-Mc-Lean/r_46465/
</overview>]
['forSaleByOwner ', None]
['forSale ', None]
[Finished in 0.6s]
Любые идеи о том, что вызывает это?
2 ответа
По умолчанию, BeautifulSoup
приводит все теги в нижний регистр. Вы можете увидеть это в приведенных выше данных результатов: region
тег включает в себя forsalebyowner
а также forsale
как часть его содержания, в то время как они forSaleByOwner
а также forSale
в исходных данных.
К счастью, вы можете переопределить это поведение, указав, что вы используете XML при создании BeautifulSoup
объект, однако вам нужно будет обрезать часть содержимого не-XML страницы, прежде чем сделать это:
url = requests.get("https://raw.github.com/rfarley90/random/master/zillowresults.html")
pageText = url.text.split('\n')
# exclude initial text & end comment
pageXML = ''.join( pageText[1:pageText.index(u'<!--')] )
soup = BeautifulSoup(pageXML, "xml")
BeautifulSoup найти запросы в нижнем регистре
>>> url = requests.get("https://raw.github.com/rfarley90/random/master/zillowresults.html")
>>> soup = BeautifulSoup(pageText)
>>> soup.find('usecode')
<usecode>SingleFamily</usecode>
>>> soup.find('usecode').text
u'SingleFamily'
или же:
>>> soup.response.results.result.usecode
<usecode>SingleFamily</usecode>
>>> soup.response.results.result.usecode.text
u'SingleFamily'