R - Извлечение названия города из двухсловных строк
Я работаю с базой данных, содержащей названия городов, например:
cities <- c("Fairhope 3NE", "Gadsden 19N", "Selma 13 WNW", "Batesville 8 WNW",
"Elgin 5 S", "Tucson 11 W", "Williams 35 NNW", "Fallbrook 5 NE",
"Stovepipe Wells 1 SW", "Cortez 8 SE", "La Junta 17 WSW", "Montrose 11 ENE",
"Everglades City 5 NE", "Sebring 23 SSE", "Brunswick 23 S", "Newton 11 SW",
"Newton 8 W", "Watkinsville 5 SSE", "Des Moines 17 E", "Champaign 9 SW",
"Shabbona 5 NNE", "Bedford 5 WNW", "Manhattan 6 SSW", "Oakley 19 SSW",
"Bowling Green 21 NNE", "Versailles 3 NNW", "Lafayette 13 SE",
"Monroe 26 N", "Goodridge 12 NNW", "Chillicothe 22 ENE", "Joplin 24 N",
"Salem 10 W", "Holly Springs 4 N", "Newton 5 ENE", "Asheville 13 S",
"Asheville 8 SSW", "Durham 11 W", "Jamestown 38 WSW", "Medora 7 E",
"Northgate 5 ESE", "Harrison 20 SSE", "Lincoln 11 SW", "Lincoln 8 ENE",
"Whitman 5 ENE", "Las Cruces 20 N", "Los Alamos 13 W", "Socorro 20 N",
"Mercury 3 SSW", "Coshocton 8 NNE", "Goodwell 2 E", "Stillwater 2 W",
"Stillwater 5 WNW", "Coos Bay 8 SW", "Corvallis 10 SSW", "Riley 10 WSW",
"Blackville 3 W", "McClellanville 7 NE", "Aberdeen 35 WNW", "Buffalo 13 ESE",
"Pierre 24 S", "Sioux Falls 14 NNE", "Crossville 7 NW", "Austin 33 NW",
"Bronte 11 NNE", "Edinburg 17 NNE", "Monahans 6 ENE", "Muleshoe 19 S",
"Palestine 6 WNW", "Panther Junction 2 N", "Necedah 5 WNW")
И я хочу извлечь только название города. Следующий код работает в некоторых случаях:
gsub( " .*$", "", cities)
но это не подходит для городов с двумя словами, как, например, Stovepipe Wells 1 SW
а также La Junta 17 WSW
,
Есть идеи решений для этих случаев?
5 ответов
Предполагая, что каждая строка заканчивается одним и тем же шаблоном:
- пространство
- цифра (ы)
- необязательное пространство,
- комбинация
ENSW
Смотрите код в использовании здесь
gsub(" \\d+ ?[ENSW]+$", "", cities)
Результат:
[1] "Fairhope" "Gadsden" "Selma" "Batesville"
[5] "Elgin" "Tucson" "Williams" "Fallbrook"
[9] "Stovepipe Wells" "Cortez" "La Junta" "Montrose"
[13] "Everglades City" "Sebring" "Brunswick" "Newton"
[17] "Newton" "Watkinsville" "Des Moines" "Champaign"
[21] "Shabbona" "Bedford" "Manhattan" "Oakley"
[25] "Bowling Green" "Versailles" "Lafayette" "Monroe"
[29] "Goodridge" "Chillicothe" "Joplin" "Salem"
[33] "Holly Springs" "Newton" "Asheville" "Asheville"
[37] "Durham" "Jamestown" "Medora" "Northgate"
[41] "Harrison" "Lincoln" "Lincoln" "Whitman"
[45] "Las Cruces" "Los Alamos" "Socorro" "Mercury"
[49] "Coshocton" "Goodwell" "Stillwater" "Stillwater"
[53] "Coos Bay" "Corvallis" "Riley" "Blackville"
[57] "McClellanville" "Aberdeen" "Buffalo" "Pierre"
[61] "Sioux Falls" "Crossville" "Austin" "Bronte"
[65] "Edinburg" "Monahans" "Muleshoe" "Palestine"
[69] "Panther Junction" "Necedah"
Вы можете удалить все подстроки, начиная с цифры:
> sub("\\s*\\d.*", "", cities)
[1] "Fairhope" "Gadsden" "Selma" "Batesville" "Elgin" "Tucson" "Williams" "Fallbrook" "Stovepipe Wells"
[10] "Cortez" "La Junta" "Montrose" "Everglades City" "Sebring" "Brunswick" "Newton" "Newton" "Watkinsville"
[19] "Des Moines" "Champaign" "Shabbona" "Bedford" "Manhattan" "Oakley" "Bowling Green" "Versailles" "Lafayette"
[28] "Monroe" "Goodridge" "Chillicothe" "Joplin" "Salem" "Holly Springs" "Newton" "Asheville" "Asheville"
[37] "Durham" "Jamestown" "Medora" "Northgate" "Harrison" "Lincoln" "Lincoln" "Whitman" "Las Cruces"
[46] "Los Alamos" "Socorro" "Mercury" "Coshocton" "Goodwell" "Stillwater" "Stillwater" "Coos Bay" "Corvallis"
[55] "Riley" "Blackville" "McClellanville" "Aberdeen" "Buffalo" "Pierre" "Sioux Falls" "Crossville" "Austin"
[64] "Bronte" "Edinburg" "Monahans" "Muleshoe" "Palestine" "Panther Junction" "Necedah"
>
Вот,
\\s*
- соответствует 0+ пробелам\\d
- цифра.*
- остальная часть строки.
Смотрите демо-версию регулярного выражения. sub
выполняет только одну операцию сопоставления и замены.
Вы могли бы использовать
gsub("(\\D+)\\s+.*", "\\1", cities)
получая
[1] "Fairhope" "Gadsden" "Selma" "Batesville" "Elgin"
[6] "Tucson" "Williams" "Fallbrook" "Stovepipe Wells" "Cortez"
[11] "La Junta" "Montrose" "Everglades City" "Sebring" "Brunswick"
[16] "Newton" "Newton" "Watkinsville" "Des Moines" "Champaign"
[21] "Shabbona" "Bedford" "Manhattan" "Oakley" "Bowling Green"
[26] "Versailles" "Lafayette" "Monroe" "Goodridge" "Chillicothe"
[31] "Joplin" "Salem" "Holly Springs" "Newton" "Asheville"
[36] "Asheville" "Durham" "Jamestown" "Medora" "Northgate"
[41] "Harrison" "Lincoln" "Lincoln" "Whitman" "Las Cruces"
[46] "Los Alamos" "Socorro" "Mercury" "Coshocton" "Goodwell"
[51] "Stillwater" "Stillwater" "Coos Bay" "Corvallis" "Riley"
[56] "Blackville" "McClellanville" "Aberdeen" "Buffalo" "Pierre"
[61] "Sioux Falls" "Crossville" "Austin" "Bronte" "Edinburg"
[66] "Monahans" "Muleshoe" "Palestine" "Panther Junction" "Necedah"
Объяснено, это говорит:
(\\D+) # not a digit, 1+ times
\\s+ # at least one whitespace
.* # rest of the string
Затем это заменяется первой захваченной группой, \\1
в этом случае.
Это регулярное выражение должно работать где угодно, включая R: .+?(?=\ \d)
Он включает в себя все до первого пробела, за которым следует цифра.