Сегментация предложений с использованием Regex

Question

Сегментация предложений с использованием Regex

У меня есть несколько текстовых (SMS) сообщений, и я хочу сегментировать их, используя точку ('.') В качестве разделителя. Я не могу обрабатывать следующие типы сообщений. Как я могу сегментировать эти сообщения с помощью Regex в Python.

До сегментации:

'гипер-подсчет 16,8 ммоль / л.плз обзор b4 5 вечера. просто чтобы сообщить вам спасибо'
'нет кроватей 8. пожалуйста, сообщите человеку in-charge.tq'

После сегментации:

'гипер счет 16,8 ммоль / л' 'плз обзор b4 5 вечера' 'просто чтобы сообщить вам' 'спасибо вам'
'нет мест 8' 'пожалуйста, сообщите ответственному лицу' 'tq'

Каждая строка - это отдельное сообщение

Обновлено:

Я делаю обработку естественного языка, и я чувствую, что это нормально для лечения '16.8mmmol/l' а также 'no of beds 8.2 cups of tea.' как то же самое. 80% точности мне достаточно, но я хочу уменьшить False Positive как можно больше.

2

python regex text-segmentation

Источник

user585329 19 июл '11 в 10:17

5 ответов

Решение

Несколько недель назад я искал регулярное выражение, которое перехватывало бы каждую строку, представляющую число в строке, независимо от формы, в которой написано число, даже те, которые в научной записи, даже индийские числа, имеющие запятые: см. Эту ветку

Я использую это регулярное выражение в следующем коде, чтобы дать решение вашей проблемы.

В отличие от других ответов, в моем решении точка в "8". не рассматривается как точка, по которой должно быть выполнено разбиение, поскольку его можно прочитать как число с плавающей точкой, не имеющее цифры после точки.

import re

regx = re.compile('(?<![\d.])(?!\.\.)'
                  '(?<![\d.][eE][+-])(?<![\d.][eE])(?<!\d[.,])'
                  '' #---------------------------------
                  '([+-]?)'
                  '(?![\d,]*?\.[\d,]*?\.[\d,]*?)'
                  '(?:0|,(?=0)|(?<!\d),)*'
                  '(?:'
                  '((?:\d(?!\.[1-9])|,(?=\d))+)[.,]?'
                  '|\.(0)'
                  '|((?<!\.)\.\d+?)'
                  '|([\d,]+\.\d+?))'
                  '0*'
                  '' #---------------------------------
                  '(?:'
                  '([eE][+-]?)(?:0|,(?=0))*'
                  '(?:'
                  '(?!0+(?=\D|\Z))((?:\d(?!\.[1-9])|,(?=\d))+)[.,]?'
                  '|((?<!\.)\.(?!0+(?=\D|\Z))\d+?)'
                  '|([\d,]+\.(?!0+(?=\D|\Z))\d+?))'
                  '0*'
                  ')?'
                  '' #---------------------------------
                  '(?![.,]?\d)')



simpler_regex = re.compile('(?<![\d.])0*(?:'
                           '(\d+)\.?|\.(0)'
                           '|(\.\d+?)|(\d+\.\d+?)'
                           ')0*(?![\d.])')


def split_outnumb(string, regx=regx, a=0):
    excluded_pos = [x for mat in regx.finditer(string) for x in range(*mat.span()) if string[x]=='.']
    li = []
    for xdot in (x for x,c in enumerate(string) if c=='.' and x not in excluded_pos):
        li.append(string[a:xdot])
        a = xdot + 1
    li.append(string[a:])
    return li





for sentence in ('hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u',
                 'no of beds 8.please inform person in-charge.tq',
                 'no of beds 8.2 cups of tea.tarabada',
                 'this number .977 is a float',
                 'numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific.notation',
                 'an indian number 12,45,782.258 in this.sentence and 45,78,325. is another',
                 'no dot in this sentence',
                 ''):
    print 'sentence         =',sentence
    print 'splitted eyquem  =',split_outnumb(sentence)
    print 'splitted eyqu 2  =',split_outnumb(sentence,regx=simpler_regex)
    print 'splitted gurney  =',re.split(r"\.(?!\d)", sentence)
    print 'splitted stema   =',re.split('(?<!\d)\.|\.(?!\d)',sentence)
    print

результат

sentence         = hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u
splitted eyquem  = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
splitted eyqu 2  = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
splitted gurney  = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
splitted stema   = ['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']

sentence         = no of beds 8.please inform person in-charge.tq
splitted eyquem  = ['no of beds 8.please inform person in-charge', 'tq']
splitted eyqu 2  = ['no of beds 8.please inform person in-charge', 'tq']
splitted gurney  = ['no of beds 8', 'please inform person in-charge', 'tq']
splitted stema   = ['no of beds 8', 'please inform person in-charge', 'tq']

sentence         = no of beds 8.2 cups of tea.tarabada
splitted eyquem  = ['no of beds 8.2 cups of tea', 'tarabada']
splitted eyqu 2  = ['no of beds 8.2 cups of tea', 'tarabada']
splitted gurney  = ['no of beds 8.2 cups of tea', 'tarabada']
splitted stema   = ['no of beds 8.2 cups of tea', 'tarabada']

sentence         = this number .977 is a float
splitted eyquem  = ['this number .977 is a float']
splitted eyqu 2  = ['this number .977 is a float']
splitted gurney  = ['this number .977 is a float']
splitted stema   = ['this number ', '977 is a float']

sentence         = numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific.notation
splitted eyquem  = ['numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific', 'notation']
splitted eyqu 2  = ['numbers 214.21E+45 , 478945.E-201 and .12478E+02 are in scientific', 'notation']
splitted gurney  = ['numbers 214.21E+45 , 478945', 'E-201 and .12478E+02 are in scientific', 'notation']
splitted stema   = ['numbers 214.21E+45 , 478945', 'E-201 and ', '12478E+02 are in scientific', 'notation']

sentence         = an indian number 12,45,782.258 in this.sentence and 45,78,325. is another
splitted eyquem  = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325. is another']
splitted eyqu 2  = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325. is another']
splitted gurney  = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325', ' is another']
splitted stema   = ['an indian number 12,45,782.258 in this', 'sentence and 45,78,325', ' is another']

sentence         = no dot in this sentence
splitted eyquem  = ['no dot in this sentence']
splitted eyqu 2  = ['no dot in this sentence']
splitted gurney  = ['no dot in this sentence']
splitted stema   = ['no dot in this sentence']

sentence         = 
splitted eyquem  = ['']
splitted eyqu 2  = ['']
splitted gurney  = ['']
splitted stema   = ['']

РЕДАКТИРОВАТЬ 1

Я добавил simpler_regex, определяющий числа, из моего поста в этой теме

Я не обнаруживаю индийские цифры и цифры в научных обозначениях, но на самом деле дает те же результаты

5

Источник

user551449 20 июл '11 в 00:15

Вы можете использовать отрицательное утверждение, чтобы соответствовать "." не следует за цифрой, а использовать re.split на этом:

>>> import re
>>> splitter = r"\.(?!\d)"
>>> s = 'hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u'
>>> re.split(splitter, s)
['hyper count 16.8mmol/l', 'plz review b4 5pm', 'just to inform u', 'thank u']
>>> s = 'no of beds 8.please inform person in-charge.tq'
>>> re.split(splitter, s)
['no of beds 8', 'please inform person in-charge', 'tq']

2

Источник

user281368 19 июл '11 в 10:42

Это зависит от вашего точного предложения, но вы можете попробовать:

.*?[a-zA-Z0-9]\.(?!\d)

Посмотри, работает ли это. Это будет держать в кавычках, но вы можете удалить их при необходимости.

0

Источник

user78944 19 июл '11 в 10:32

"...".split(".")

split это встроенная функция Python, которая разделяет строку по определенному символу.

-1

Источник

user398968 19 июл '11 в 10:21

Другие вопросы по тегам python regex text-segmentation

user626273 19 июл '11 в 10:43 2011-07-19 10:43 · Accepted Answer · 2011-07-19 10:43

Как насчет

re.split('(?<!\d)\.|\.(?!\d)', 'hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u')

Обозначения гарантируют, что на одной или другой стороне не будет цифры. Так что это охватывает также 16.8 дело. Это выражение не будет разделяться, если на обеих сторонах есть цифры.