Определить язык строки

Question

Определить язык строки

Мне нужно определить язык строки, прочитанной из файла PDF, текст в основном на английском языке, но "NLLanguageRecognizer" возвращает, что это "румынский"

функция, которую я использую:

 class func detectedLangaugeFormat(for string: String) -> String {
       if #available(iOS 12.0, *) {
           let recognizer = NLLanguageRecognizer()
           recognizer.processString(string)
        guard let languageCode = recognizer.dominantLanguage?.rawValue else { return "rtl" }
           let detectedLangauge = Locale.current.localizedString(forIdentifier: languageCode)
           print("lan")
           let currentLocale = NSLocale.current as NSLocale
           let direction: NSLocale.LanguageDirection = NSLocale.characterDirection(forLanguage: languageCode)
            if direction == .rightToLeft {
                return "rtl"
            }else if direction == .leftToRight {
                return "ltr"
            }
       } else {
           // Fallback on earlier versions
       }


    return "rtl"
   }

и строка, переданная этому методу:

"\r\n                A Simple PDF File \r\n                   This is a small demonstration .pdf file - \r\n                   just for use in the Virtual Mechanics tutorials. More text. And more \r\n                   text. And more text. And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. And more text. Boring, zzzzz. And more text. And more text. And \r\n                   more text. And more text. And more text. And more text. And more text. \r\n                   And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. And more text. And more text. Even more. Continued on page 2 ...\r\n                Simple PDF File 2 \r\n                   ...continued from page 1. Yet more text. And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. Oh, how boring typing this stuff. But not as boring as watching \r\n                   paint dry. And more text. And more text. And more text. And more text. \r\n                   Boring.  More, a little more text. The end, and just as well. "

2

ios swift language-recognition

Источник

user3405821 24 дек '19 в 13:19

3 ответа

Другие вопросы по тегам ios swift language-recognition

user10715519 24 дек '19 в 14:07 2019-12-24 14:07 · Answer 1 · 2019-12-24 14:07

Одним из возможных решений может быть удаление более одного пробела в строке.

let regex = try? NSRegularExpression(pattern: "  +", options: .caseInsensitive)
    str = regex?.stringByReplacingMatches(in: str, options: [], range: NSRange(location: 0, length: str.count), withTemplate: " ") ?? ""

Я пробовал вашу строку с этим регулярным выражением, и она сработала. Распознаватель языка вернул код языка en lang.

user5501940 24 дек '19 в 14:07 2019-12-24 14:07 · Answer 2 · 2019-12-24 14:07

По какой-то причине пробелы и символы новой строки являются результатом processString(_:) быть неэффективным. Что вам нужно сделать, так это избавиться от них перед передачей строки вашему методу:

let givenString = "\r\n                A Simple PDF File \r\n                   This is a small demonstration .pdf file - \r\n                   just for use in the Virtual Mechanics tutorials. More text. And more \r\n                   text. And more text. And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. And more text. Boring, zzzzz. And more text. And more text. And \r\n                   more text. And more text. And more text. And more text. And more text. \r\n                   And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. And more text. And more text. Even more. Continued on page 2 ...\r\n                Simple PDF File 2 \r\n                   ...continued from page 1. Yet more text. And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. Oh, how boring typing this stuff. But not as boring as watching \r\n                   paint dry. And more text. And more text. And more text. And more text. \r\n                   Boring.  More, a little more text. The end, and just as well. "
let trimmedString = givenString.trimmingCharacters(in: .whitespacesAndNewlines)

let result = detectedLangaugeFormat(for: trimmedString)
print(result) // ltr

На этом этапе он должен быть узнаваемым как английский (если вы напечатаете detectedLangauge внутри вашего метода вместо "lan" вы найдете его "English").

let detectedLangauge = Locale.current.localizedString(forIdentifier: languageCode)
print(detectedLangauge) // Optional("English")

user9204192 24 дек '19 в 16:47 2019-12-24 16:47 · Answer 3 · 2019-12-24 16:47

Удалите неалфавитные символы [WhiteSpaces,!,@,# И т.д.] в строке, а затем попробуйте определить язык.

extension String{
    func findFirstAlphabetic() -> String.Index?{
        for index  in self.indices{
            if String(self[index]).isAlphanumeric == true{
                return index
            }
        }
        return nil
    }
    var isAlphanumeric: Bool {
        return !isEmpty && range(of: "[^a-zA-Z0-9]", options: .regularExpression) == nil
    }
    func alphabetic_Leading_SubString() -> String?{
        if let startIndex =  self.findFirstAlphabetic(){
            let newSubString = self[startIndex..<self.endIndex]
            return String(newSubString)
        }
        return nil
    }
}

Применение:-

let string = "\r\n                A Simple PDF File \r\n                   This is a small demonstration .pdf file - \r\n                   just for use in the Virtual Mechanics tutorials. More text. And more \r\n                   text. And more text. And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. And more text. Boring, zzzzz. And more text. And more text. And \r\n                   more text. And more text. And more text. And more text. And more text. \r\n                   And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. And more text. And more text. Even more. Continued on page 2 ...\r\n                Simple PDF File 2 \r\n                   ...continued from page 1. Yet more text. And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. Oh, how boring typing this stuff. But not as boring as watching \r\n                   paint dry. And more text. And more text. And more text. And more text. \r\n                   Boring.  More, a little more text. The end, and just as well. "
detectedLangaugeFormat(for: string.alphabetic_Leading_SubString()!)