Определить язык строки
Мне нужно определить язык строки, прочитанной из файла PDF, текст в основном на английском языке, но "NLLanguageRecognizer" возвращает, что это "румынский"
функция, которую я использую:
class func detectedLangaugeFormat(for string: String) -> String {
if #available(iOS 12.0, *) {
let recognizer = NLLanguageRecognizer()
recognizer.processString(string)
guard let languageCode = recognizer.dominantLanguage?.rawValue else { return "rtl" }
let detectedLangauge = Locale.current.localizedString(forIdentifier: languageCode)
print("lan")
let currentLocale = NSLocale.current as NSLocale
let direction: NSLocale.LanguageDirection = NSLocale.characterDirection(forLanguage: languageCode)
if direction == .rightToLeft {
return "rtl"
}else if direction == .leftToRight {
return "ltr"
}
} else {
// Fallback on earlier versions
}
return "rtl"
}
и строка, переданная этому методу:
"\r\n A Simple PDF File \r\n This is a small demonstration .pdf file - \r\n just for use in the Virtual Mechanics tutorials. More text. And more \r\n text. And more text. And more text. And more text. \r\n And more text. And more text. And more text. And more text. And more \r\n text. And more text. Boring, zzzzz. And more text. And more text. And \r\n more text. And more text. And more text. And more text. And more text. \r\n And more text. And more text. \r\n And more text. And more text. And more text. And more text. And more \r\n text. And more text. And more text. Even more. Continued on page 2 ...\r\n Simple PDF File 2 \r\n ...continued from page 1. Yet more text. And more text. And more text. \r\n And more text. And more text. And more text. And more text. And more \r\n text. Oh, how boring typing this stuff. But not as boring as watching \r\n paint dry. And more text. And more text. And more text. And more text. \r\n Boring. More, a little more text. The end, and just as well. "
3 ответа
Одним из возможных решений может быть удаление более одного пробела в строке.
let regex = try? NSRegularExpression(pattern: " +", options: .caseInsensitive)
str = regex?.stringByReplacingMatches(in: str, options: [], range: NSRange(location: 0, length: str.count), withTemplate: " ") ?? ""
Я пробовал вашу строку с этим регулярным выражением, и она сработала. Распознаватель языка вернул код языка en lang.
По какой-то причине пробелы и символы новой строки являются результатом processString(_:)
быть неэффективным. Что вам нужно сделать, так это избавиться от них перед передачей строки вашему методу:
let givenString = "\r\n A Simple PDF File \r\n This is a small demonstration .pdf file - \r\n just for use in the Virtual Mechanics tutorials. More text. And more \r\n text. And more text. And more text. And more text. \r\n And more text. And more text. And more text. And more text. And more \r\n text. And more text. Boring, zzzzz. And more text. And more text. And \r\n more text. And more text. And more text. And more text. And more text. \r\n And more text. And more text. \r\n And more text. And more text. And more text. And more text. And more \r\n text. And more text. And more text. Even more. Continued on page 2 ...\r\n Simple PDF File 2 \r\n ...continued from page 1. Yet more text. And more text. And more text. \r\n And more text. And more text. And more text. And more text. And more \r\n text. Oh, how boring typing this stuff. But not as boring as watching \r\n paint dry. And more text. And more text. And more text. And more text. \r\n Boring. More, a little more text. The end, and just as well. "
let trimmedString = givenString.trimmingCharacters(in: .whitespacesAndNewlines)
let result = detectedLangaugeFormat(for: trimmedString)
print(result) // ltr
На этом этапе он должен быть узнаваемым как английский (если вы напечатаете detectedLangauge
внутри вашего метода вместо "lan" вы найдете его "English").
let detectedLangauge = Locale.current.localizedString(forIdentifier: languageCode)
print(detectedLangauge) // Optional("English")
Удалите неалфавитные символы [WhiteSpaces,!,@,# И т.д.] в строке, а затем попробуйте определить язык.
extension String{
func findFirstAlphabetic() -> String.Index?{
for index in self.indices{
if String(self[index]).isAlphanumeric == true{
return index
}
}
return nil
}
var isAlphanumeric: Bool {
return !isEmpty && range(of: "[^a-zA-Z0-9]", options: .regularExpression) == nil
}
func alphabetic_Leading_SubString() -> String?{
if let startIndex = self.findFirstAlphabetic(){
let newSubString = self[startIndex..<self.endIndex]
return String(newSubString)
}
return nil
}
}
Применение:-
let string = "\r\n A Simple PDF File \r\n This is a small demonstration .pdf file - \r\n just for use in the Virtual Mechanics tutorials. More text. And more \r\n text. And more text. And more text. And more text. \r\n And more text. And more text. And more text. And more text. And more \r\n text. And more text. Boring, zzzzz. And more text. And more text. And \r\n more text. And more text. And more text. And more text. And more text. \r\n And more text. And more text. \r\n And more text. And more text. And more text. And more text. And more \r\n text. And more text. And more text. Even more. Continued on page 2 ...\r\n Simple PDF File 2 \r\n ...continued from page 1. Yet more text. And more text. And more text. \r\n And more text. And more text. And more text. And more text. And more \r\n text. Oh, how boring typing this stuff. But not as boring as watching \r\n paint dry. And more text. And more text. And more text. And more text. \r\n Boring. More, a little more text. The end, and just as well. "
detectedLangaugeFormat(for: string.alphabetic_Leading_SubString()!)