Кто-нибудь узнает этот формат MARC JSON?
Кто-нибудь распознает этот формат (см. Вставку внизу)? Это из репертуара Ведеттс-Матьер (RVM). Это ни один из этих двух:
- https://metacpan.org/pod/Catmandu::Exporter::MARC::MiJ
- http://search.cpan.org/~cfouts/MARC-File-JSON-0.003/lib/MARC/File/JSON.pm
Я могу программировать на Perl, также опубликованном как https://github.com/LibreCat/Catmandu-MARC/issues/88.
Я могу взломать его только с помощью XS::JSON, но я не знаю, как справиться с этой странной акцентной кодировкой (несколько примеров строк, показанных из 325):
{grave}e
{ring}Z
{ringb}h
{ringb}s
{rlig}a
{rlig}A
Вот странный МАРК JSON:
{
"rows" : [
{
"RecordNumber" : "1",
"Tag" : "LDR",
"Indicators" : "",
"Content" : "00533nz 2200205n 4500"
}
,
{
"RecordNumber" : "1",
"Tag" : "001",
"Indicators" : "\" \"",
"Content" : "201-0000001"
}
,
{
"RecordNumber" : "1",
"Tag" : "005",
"Indicators" : "\" \"",
"Content" : "20121025110000.0"
}
,
{
"RecordNumber" : "1",
"Tag" : "008",
"Indicators" : "\" \"",
"Content" : "790704\\nfanvnnbabn\\\\\\\\\\\\\\\\\\\\\\b\\ana\\\\\\\\\\\\"
}
,
{
"RecordNumber" : "1",
"Tag" : "016",
"Indicators" : "\\\\",
"Content" : "$a0509B3366"
}
,
{
"RecordNumber" : "1",
"Tag" : "035",
"Indicators" : "\\\\",
"Content" : "$a(ISM)8013850"
}
,
{
"RecordNumber" : "1",
"Tag" : "035",
"Indicators" : "9\\",
"Content" : "$a201-0000001"
}
,
{
"RecordNumber" : "1",
"Tag" : "040",
"Indicators" : "\\\\",
"Content" : "$aCaQQLa$bfre"
}
,
{
"RecordNumber" : "1",
"Tag" : "150",
"Indicators" : "\\\\",
"Content" : "$aAlg{grave}ebres de Von Neumann"
}
,
{
"RecordNumber" : "1",
"Tag" : "450",
"Indicators" : "\\\\",
"Content" : "$wnne$aVon Neumann, Alg{grave}ebres de"
}
,
{
"RecordNumber" : "1",
"Tag" : "450",
"Indicators" : "\\\\",
"Content" : "$aW*-alg{grave}ebres"
}
,
{
"RecordNumber" : "1",
"Tag" : "550",
"Indicators" : "\\\\",
"Content" : "$wg$aC*-alg{grave}ebres"
}
,
{
"RecordNumber" : "1",
"Tag" : "550",
"Indicators" : "\\\\",
"Content" : "$wg$aEspace de Hilbert"
}
,
{
"RecordNumber" : "1",
"Tag" : "697",
"Indicators" : "\\\\",
"Content" : "$amm."
}
,
{
"RecordNumber" : "1",
"Tag" : "750",
"Indicators" : "\\7",
"Content" : "$aVon Neumann, Alg{grave}ebres de$2ram"
}
,
{
"RecordNumber" : "1",
"Tag" : "750",
"Indicators" : "\\0",
"Content" : "$aVon Neumann algebras"
}
]
}
ДОБАВЛЕНО: это кодирование акцента от MARCmkr. Я использовал следующее:
use MARC::File::MARCMaker; # https://metacpan.org/pod/MARC::File::MARCMaker
# for some reason can't be found by module name, so use:
# cpanm http://www.cpan.org/authors/id/E/EI/EIJABB/MARC-File-MARCMaker-0.05.tar.gz
my $marc_charset = MARC::File::MARCMaker::usmarc_default();
$content = MARC::File::MARCMaker::_maker2char ($content, $marc_charset);
Но когда я проверяю его, например, по этому тексту https://github.com/gmcharlt/marc-perl/blob/e8e0ecc92946d6dcb3c2270706041a30eff0f68d/marc-marcmaker/t/marcmaker.t#L92, он просто переводит акценты / лигаты в текст. Я попытался открыть переведенный текст в браузере: некоторые объекты не интерпретируются, и ни один из них не акцентировал внимание на следующем символе. Так что я думаю, что теперь мне нужно использовать какой-то модуль "XML для Unicode", чтобы закончить перевод
This a test of diacritics like the uppercase Polish L in
Ł´od´z, the uppercase Scandinavia O in &Ostrok;st, the
uppercase D with crossbar in Đuro, the uppercase Icelandic
thorn in Þann, the uppercase digraph AE in Ægir, the
uppercase digraph OE in Œuvres, the soft sign in
rech&softsign;, the middle dot in col·lecci´o, the musical
flat in F♭, the patent mark in Frizbee®, the plus or minus
sign in ±54%, the uppercase O-hook in B&Ohorn;, the
uppercase U-hook in X&Uhorn;A, the alif in
mas&mlrhring;alah, the ayn in &mllhring;arab, the lowercase
Polish l in Włocław, the lowercase Scandinavian o in
K&ostrok;benhavn, the lowercase d with crossbar in đavola,
the lowercase Icelandic thorn in þann, the lowercase digraph
ae in være, the lowercase digraph oe in cœur, the lowercase
hardsign in s&hardsign;ezd, the Turkish dotless i in masalı,
the British pound sign in £5.95, the lowercase eth in
verður, the lowercase o-hook (with pseudo question mark) in
S&hooka;&ohorn;, the lowercase u-hook in T&uhorn; D&uhorn;c,
the pseudo question mark in c&hooka;ui, the grave accent in
tr`es, the acute accent in d´esir´ee, the circumflex in
cˆote, the tilde in ma˜nana, the macron in T¯okyo, the breve
in russki˘i, the dot above in ˙zaba, the dieresis (umlaut)
in L¨owenbr¨au, the caron (hachek) in ˇcrny, the circle
above (angstrom) in ˚arbok, the ligature first and second
halves in d&llig;i&rlig;ad&llig;i&rlig;a, the high comma off
center in rozdel&rcommaa;ovac, the double acute in
id˝oszaki, the candrabindu (breve with dot above) in
Ali&candra;iev, the cedilla in ¸ca va comme ¸ca, the right
hook in viet˛a, the dot below in te&dotb;da, the double dot
below in &under;k&under;hu&dbldotb;tbah, the circle below in
Sa&dotb;msk&ringb;rta, the double underscore in
&dblunder;Ghulam, the left hook in Lech Wał&commab;esa, the
right cedilla (comma below) in khŗong, the upadhmaniya (half
circle below) in &breveb;humantuˇs, double tilde, first and
second halves in &ldbltil;n&rdbltil;galan, high comma
(centered) in g&commaa;eotermika.
1 ответ
Это проблема кодирования. Лидер говорит, что данные закодированы в MARC-8. Ваши данные JSON должны быть закодированы в UTF-8. _maker2char()
использования usmarc_default()
, который отображает кодирование мнемонического акцента в кодированные MARC-8 символы. Используйте MARC::Charset для преобразования данных в UTF-8. Это должно работать:
#!/usr/bin/env perl
use 5.014;
use utf8;
use strict;
use autodie;
use warnings;
use MARC::File::MARCMaker;
use MARC::Charset qw(marc8_to_utf8);
my $data = q{This is a test of diacritics like the uppercase Polish L in {Lstrok}{acute}od{acute}z
the uppercase Scandinavia O in {Ostrok}st
the uppercase D with crossbar in {Dstrok}uro
the uppercase Icelandic thorn in {THORN}ann
the uppercase digraph AE in {AElig}gir
the uppercase digraph OE in {OElig}uvres
the soft sign in rech{softsign}
the middle dot in col{middot}lecci{acute}o
the musical flat in F{flat}
the patent mark in Frizbee{reg}
the plus or minus sign in {plusmn}54%
the uppercase O-hook in B{Ohorn}
the uppercase U-hook in X{Uhorn}A
the alif in mas{mlrhring}alah
the ayn in {mllhring}arab
the lowercase Polish l in W{lstrok}oc{lstrok}aw
the lowercase Scandinavian o in K{ostrok}benhavn
the lowercase d with crossbar in {dstrok}avola
the lowercase Icelandic thorn in {thorn}ann
the lowercase digraph ae in v{aelig}re
the lowercase digraph oe in c{oelig}ur
the lowercase hardsign in s{hardsign}ezd
the Turkish dotless i in masal{inodot}
the British pound sign in {pound}5.95
the lowercase eth in ver{eth}ur
the lowercase o-hook (with pseudo question mark) in S{hooka}{ohorn}
the lowercase u-hook in T{uhorn} D{uhorn}c
the pseudo question mark in c{hooka}ui
the grave accent in tr{grave}es
the acute accent in d{acute}esir{acute}ee
the circumflex in c{circ}ote
the tilde in ma{tilde}nana
the macron in T{macr}okyo
the breve in russki{breve}i
the dot above in {dot}zaba
the dieresis (umlaut) in L{uml}owenbr{uml}au
the caron (hachek) in {caron}crny
the circle above (angstrom) in {ring}arbok
the ligature first and second halves in d{llig}i{rlig}ad{llig}i{rlig}a
the high comma off center in rozdel{rcommaa}ovac
the double acute in id{dblac}oszaki
the candrabindu (breve with dot above) in Ali{candra}iev
the cedilla in {cedil}ca va comme {cedil}ca
the right hook in viet{ogon}a
the dot below in te{dotb}da
the double dot below in {under}k{under}hu{dbldotb}tbah
the circle below in Sa{dotb}msk{ringb}rta
the double underscore in {dblunder}Ghulam
the left hook in Lech Wa{lstrok}{commab}esa
the right cedilla (comma below) in kh{rcedil}ong
the upadhmaniya (half circle below) in {breveb}humantu{caron}s
double tilde
first and second halves in {ldbltil}n{rdbltil}galan
high comma (centered) in g{commaa}eotermika.
Alg{grave}ebres de Von Neumann
{grave}e
{ring}Z
{ringb}h
{ringb}s
{rlig}a
{rlig}A
};
my $marc_charset = MARC::File::MARCMaker::usmarc_default();
my $marc8 = MARC::File::MARCMaker::_maker2char($data, $marc_charset);
# prepare STDOUT for utf8
binmode(STDOUT, 'utf8');
# convert marc8 to utf8
my $utf8 = marc8_to_utf8($marc8);
say $utf8;
Выход:
This is a test of diacritics like the uppercase Polish L in Łódź
the uppercase Scandinavia O in Øst
the uppercase D with crossbar in Đuro
the uppercase Icelandic thorn in Þann
the uppercase digraph AE in Ægir
the uppercase digraph OE in Œuvres
the soft sign in rechʹ
the middle dot in col·lecció
the musical flat in F♭
the patent mark in Frizbee®
the plus or minus sign in ±54%
the uppercase O-hook in BƠ
the uppercase U-hook in XƯA
the alif in masʼalah
the ayn in ʻarab
the lowercase Polish l in Włocław
the lowercase Scandinavian o in København
the lowercase d with crossbar in đavola
the lowercase Icelandic thorn in þann
the lowercase digraph ae in være
the lowercase digraph oe in cœur
the lowercase hardsign in sʺezd
the Turkish dotless i in masalı
the British pound sign in £5.95
the lowercase eth in verður
the lowercase o-hook (with pseudo question mark) in Sở
the lowercase u-hook in Tư Dưc
the pseudo question mark in củi
the grave accent in très
the acute accent in désirée
the circumflex in côte
the tilde in mañana
the macron in Tōkyo
the breve in russkiĭ
the dot above in żaba
the dieresis (umlaut) in Löwenbräu
the caron (hachek) in črny
the circle above (angstrom) in årbok
the ligature first and second halves in di͡adi͡a
the high comma off center in rozdelo̕vac
the double acute in időszaki
the candrabindu (breve with dot above) in Alii̐ev
the cedilla in ça va comme ça
the right hook in vietą
the dot below in teḍa
the double dot below in k̲h̲ut̤bah
the circle below in Saṃskr̥ta
the double underscore in G̳hulam
the left hook in Lech Wałe̦sa
the right cedilla (comma below) in kho̜ng
the upadhmaniya (half circle below) in ḫumantuš
double tilde
first and second halves in n͠galan
high comma (centered) in ge̓otermika.
Algèbres de Von Neumann
è
Z̊
h̥
s̥
a
A