Удалить HTML-теги в строке
Как я могу удалить теги HTML из следующей строки?
<P style="MARGIN: 0cm 0cm 10pt" class=MsoNormal><SPAN style="LINE-HEIGHT: 115%;
FONT-FAMILY: 'Verdana','sans-serif'; COLOR: #333333; FONT-SIZE: 9pt">In an
email sent just three days before the Deepwater Horizon exploded, the onshore
<SPAN style="mso-bidi-font-weight: bold"><b>BP</b></SPAN> manager in charge of
the drilling rig warned his supervisor that last-minute procedural changes were
creating "chaos". April emails were given to government investigators by <SPAN
style="mso-bidi-font-weight: bold"><b>BP</b></SPAN> and reviewed by The Wall
Street Journal and are the most direct evidence yet that workers on the rig
were unhappy with the numerous changes, and had voiced their concerns to <SPAN
style="mso-bidi-font-weight: bold"><b>BP</b></SPAN>’s operations managers in
Houston. This raises further questions about whether <SPAN
style="mso-bidi-font-weight: bold"><b>BP</b></SPAN> managers properly
considered the consequences of changes they ordered on the rig, an issue
investigators say contributed to the disaster.</SPAN></p><br/>
Я пишу это в Asponse.PDF, но теги HTML отображаются в PDF. Как я могу их удалить?
2 ответа
Решение
Предупреждение: This does not work for all cases and should not be used to process untrusted user input.
using System.Text.RegularExpressions;
...
const string HTML_TAG_PATTERN = "<.*?>";
static string StripHTML (string inputString)
{
return Regex.Replace
(inputString, HTML_TAG_PATTERN, string.Empty);
}
Вы должны использовать HTML Agility Pack:
HtmlDocument doc = ...
string text = doc.DocumentElement.InnerText;