Как отфильтровать журнал доступа apache, используя awk, sed или cut
Это мой файл журнала доступа apache. Я хочу, чтобы apache access log uniq count для URL.
"2011-09-07 17:00:00" "GET /abc/index.php/contentapi/discontent/4fd590d1762eb/ALL/allowed/1/all/all/1/http/"
"2011-09-07 17:00:17" "GET /abc/index.php/contentapi/discontent/4fd590d1762eb/ALL/allowed/1/all/all/1/http/
"2011-09-07 17:00:21" "GET /abc/index.php/contentapi/discontent/4fd590d1762eb/ALL/allowed/1/all/all/1/http/
"2011-09-07 17:00:00" "GET /abc/index.php/data/dataContent/4fd590d1762eb/ALL/allowed/1/all/all/1/http/
"2011-09-07 17:00:00" "GET /abc/index.php/Api/ApiContent/4fd590d1762eb/ALL/allowed/1/all/all/1/http/
"2011-09-07 17:00:16" "GET /abc/index.php/Api/ApiContent/4fd590d1762eb/ALL/allowed/1/all/all/1/http/
"2011-09-07 17:00:29" "GET /abc/index.php/Api/ApiContent/4fd590d1762eb/ALL/allowed/1/all/all/1/http/
"2011-09-07 17:00:22" "GET /abc/index.php/htmlrequest/htmlContent/4fd590d1762eb/ALL/allowed/1/all/all/1/http/
"2011-09-07 17:00:38" "GET /abc/index.php/htmlrequest/htmlContent/4fd590d1762eb/ALL/allowed/1/all/all/1/http/
"2011-09-07 17:00:44" "GET /abc/index.php/htmlrequest/htmlContent/4fd590d1762eb/ALL/allowed/1/all/all/1/http/
"2011-09-07 17:00:33" "GET /abc/index.php/Api/ApiContent/4fd590d1762eb/ALL/allowed/1/all/all/1/http/
"2011-09-07 17:00:04" "GET /abc/index.php/site/siteContent/4fd590d1762eb/ALL/allowed/1/all/all/1/http/
"2011-09-07 17:00:06" "GET /abc/index.php/data/dataContent/4fd590d1762eb/ALL/allowed/1/all/all/1/http/
"2011-09-07 17:00:14" "GET /abc/index.php/data/dataContent/4fd590d1762eb/ALL/allowed/1/all/all/1/http
"2011-09-07 17:00:51" "GET /abc/index.php/Api/ApiContent/4fd590d1762eb/ALL/allowed/1/all/all/1/http/
"2011-09-07 17:00:33" "GET /abc/index.php/site/siteContent/4fd590d1762eb/ALL/allowed/1/all/all/1/http/
"2011-09-07 17:00:45" "GET /abc/index.php/site/siteContent/4fd590d1762eb/ALL/allowed/1/all/all/1/http/
"2011-09-07 17:00:59" "GET /abc/index.php/site/siteContent/4fd590d1762eb/ALL/allowed/1/all/all/1/http/
"2011-09-07 17:02:00" "GET /abc/index.php/site/siteContent/4fd590d1762eb/ALL/allowed/1/all/all/1/http/
"2011-09-07 17:02:09" "GET /abc/index.php/site/siteContent/4fd590d1762eb/ALL/allowed/1/all/all/1/http/
"2011-09-07 17:00:00" "GET /abc/index.php/htmlrequest/htmlContent/4fd590d1762eb/ALL/allowed/1/all/all/1/http/
"2011-09-07 17:00:09" "GET /abc/index.php/htmlrequest/htmlContent/4fd590d1762eb/ALL/allowed/1/all/all/1/http/
В приведенном выше файле я дал образец. файл журнала постоянно растет.
Ожидаемый результат
/abc/index.php/contentapi/discontent/ - 3
/abc/index.php/data/dataContent/ - 3
/abc/index.php/Api/ApiContent/ - 5
/abc/index.php/site/siteContent/ - 6
/abc/index.php/htmlrequest/htmlContent/ - 5
3 ответа
Я думаю, что в журнале apache могли быть некоторые опечатки, но как насчет этого:
$ grep -o 'abc/[^ 0-9]*/' apache.log | sort | uniq -c | sort -r
6 abc/index.php/site/siteContent/
5 abc/index.php/htmlrequest/htmlContent/
5 abc/index.php/Api/ApiContent/
3 abc/index.php/data/dataContent/
2 abc/index.php/contentapi/discontent/
1 abc/index.php/contentapi/
Это извлекает четвертое поле, которое считается URL
cat logfile | awk -F' ' '{print $4}' | awk -F'/' '{print $2"/"$3"/"$4"/"$5}' | sort | uniq -c
С GNU awk для gensub():
$ awk '{cnt[gensub(/(([/][^/]+){4}[/]).*/,"\\1","",$4)]++} END{for (url in cnt) print url " - " cnt[url]}' file
/abc/index.php/contentapi/discontent/ - 3
/abc/index.php/data/dataContent/ - 3
/abc/index.php/site/siteContent/ - 6
/abc/index.php/Api/ApiContent/ - 5
/abc/index.php/htmlrequest/htmlContent/ - 5