<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Awk on Peczenyj's Blog</title><link>http://pacman.blog.br/categories/awk/</link><description>Recent content in Awk on Peczenyj's Blog</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Sat, 29 Dec 2012 03:17:00 +0000</lastBuildDate><atom:link href="http://pacman.blog.br/categories/awk/atom.xml" rel="self" type="application/rss+xml"/><item><title>Spell Correct in GNU AWK</title><link>http://pacman.blog.br/blog/2012/12/29/spell-correct-in-gawk/</link><pubDate>Sat, 29 Dec 2012 03:17:00 +0000</pubDate><guid>http://pacman.blog.br/blog/2012/12/29/spell-correct-in-gawk/</guid><description>&lt;p>Based on &lt;a href="http://norvig.com/spell-correct.html">Peter Norvig Spell Correct&lt;/a>&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#93a1a1;background-color:#002b36;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-awk" data-lang="awk">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#586e75"># Usage: gawk -v word=some_word_to_verify -f spelling.awk [ big.txt [ big2.txt ... ]]&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#586e75"># Gawk version with 15 lines -- 04/13/2008&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#586e75"># Author: tiago (dot) peczenyj (at) gmail (dot) com&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#586e75"># about.me/peczenyj&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#586e75"># Based on : http://norvig.com/spell-correct.html&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#268bd2">function&lt;/span> edits(w,max,candidates,list, i,j){
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#719e07">for&lt;/span>(i&lt;span style="color:#719e07">=&lt;/span>&lt;span style="color:#2aa198">0&lt;/span>;i&lt;span style="color:#719e07">&amp;lt;&lt;/span> max ;&lt;span style="color:#719e07">++&lt;/span>i) &lt;span style="color:#719e07">++&lt;/span>list[&lt;span style="color:#268bd2">substr&lt;/span>(w,&lt;span style="color:#2aa198">0&lt;/span>,i) &lt;span style="color:#268bd2">substr&lt;/span>(w,i&lt;span style="color:#719e07">+&lt;/span>&lt;span style="color:#2aa198">2&lt;/span>)] &lt;span style="color:#586e75"># deletes&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#719e07">for&lt;/span>(i&lt;span style="color:#719e07">=&lt;/span>&lt;span style="color:#2aa198">0&lt;/span>;i&lt;span style="color:#719e07">&amp;lt;&lt;/span> max&lt;span style="color:#719e07">-&lt;/span>&lt;span style="color:#2aa198">1&lt;/span>;&lt;span style="color:#719e07">++&lt;/span>i) &lt;span style="color:#719e07">++&lt;/span>list[&lt;span style="color:#268bd2">substr&lt;/span>(w,&lt;span style="color:#2aa198">0&lt;/span>,i) &lt;span style="color:#268bd2">substr&lt;/span>(w,i&lt;span style="color:#719e07">+&lt;/span>&lt;span style="color:#2aa198">2&lt;/span>,&lt;span style="color:#2aa198">1&lt;/span>) &lt;span style="color:#268bd2">substr&lt;/span>(w,i&lt;span style="color:#719e07">+&lt;/span>&lt;span style="color:#2aa198">1&lt;/span>,&lt;span style="color:#2aa198">1&lt;/span>) &lt;span style="color:#268bd2">substr&lt;/span>(w,i&lt;span style="color:#719e07">+&lt;/span>&lt;span style="color:#2aa198">3&lt;/span>)] &lt;span style="color:#586e75"># transposes&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#719e07">for&lt;/span>(i&lt;span style="color:#719e07">=&lt;/span>&lt;span style="color:#2aa198">0&lt;/span>;i&lt;span style="color:#719e07">&amp;lt;&lt;/span> max ;&lt;span style="color:#719e07">++&lt;/span>i) &lt;span style="color:#719e07">for&lt;/span>(j &lt;span style="color:#719e07">in&lt;/span> alpha) &lt;span style="color:#719e07">++&lt;/span>list[&lt;span style="color:#268bd2">substr&lt;/span>(w,&lt;span style="color:#2aa198">0&lt;/span>,i) alpha[j] &lt;span style="color:#268bd2">substr&lt;/span>(w,i&lt;span style="color:#719e07">+&lt;/span>&lt;span style="color:#2aa198">2&lt;/span>)] &lt;span style="color:#586e75"># replaces&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#719e07">for&lt;/span>(i&lt;span style="color:#719e07">=&lt;/span>&lt;span style="color:#2aa198">0&lt;/span>;i&lt;span style="color:#719e07">&amp;lt;=&lt;/span> max ;&lt;span style="color:#719e07">++&lt;/span>i) &lt;span style="color:#719e07">for&lt;/span>(j &lt;span style="color:#719e07">in&lt;/span> alpha) &lt;span style="color:#719e07">++&lt;/span>list[&lt;span style="color:#268bd2">substr&lt;/span>(w,&lt;span style="color:#2aa198">0&lt;/span>,i) alpha[j] &lt;span style="color:#268bd2">substr&lt;/span>(w,i&lt;span style="color:#719e07">+&lt;/span>&lt;span style="color:#2aa198">1&lt;/span>)] &lt;span style="color:#586e75"># inserts&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#719e07">for&lt;/span>(i &lt;span style="color:#719e07">in&lt;/span> list) &lt;span style="color:#719e07">if&lt;/span>(i &lt;span style="color:#719e07">in&lt;/span> NWORDS) candidates[i] &lt;span style="color:#719e07">=&lt;/span> NWORDS[i] } 
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#268bd2">function&lt;/span> correct(word ,candidates,i,list,max,temp){
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> edits(word,&lt;span style="color:#268bd2">length&lt;/span>(word),candidates,list)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#719e07">if&lt;/span> (&lt;span style="color:#719e07">!&lt;/span>&lt;span style="color:#268bd2">asort&lt;/span>(candidates,temp)) &lt;span style="color:#719e07">for&lt;/span>(i &lt;span style="color:#719e07">in&lt;/span> list) edits(i,&lt;span style="color:#268bd2">length&lt;/span>(i),candidates)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#719e07">return&lt;/span> (max &lt;span style="color:#719e07">=&lt;/span> &lt;span style="color:#268bd2">asorti&lt;/span>(candidates)) ? candidates[max] : word }
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b58900">BEGIN&lt;/span>{ &lt;span style="color:#719e07">if&lt;/span> (&lt;span style="color:#b58900">ARGC&lt;/span> &lt;span style="color:#719e07">==&lt;/span> &lt;span style="color:#2aa198">1&lt;/span>) &lt;span style="color:#b58900">ARGV&lt;/span>[&lt;span style="color:#b58900">ARGC&lt;/span>&lt;span style="color:#719e07">++&lt;/span>] &lt;span style="color:#719e07">=&lt;/span> &lt;span style="color:#2aa198">&amp;#34;big.txt&amp;#34;&lt;/span> &lt;span style="color:#586e75"># http://norvig.com/big.txt&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#719e07">while&lt;/span>(&lt;span style="color:#719e07">++&lt;/span>i&lt;span style="color:#719e07">&amp;lt;=&lt;/span>&lt;span style="color:#268bd2">length&lt;/span>(x&lt;span style="color:#719e07">=&lt;/span>&lt;span style="color:#2aa198">&amp;#34;abcdefghijklmnopqrstuvwxyz&amp;#34;&lt;/span>)) alpha[i]&lt;span style="color:#719e07">=&lt;/span>&lt;span style="color:#268bd2">substr&lt;/span>(x,i,&lt;span style="color:#2aa198">1&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#b58900">IGNORECASE&lt;/span>&lt;span style="color:#719e07">=&lt;/span>&lt;span style="color:#b58900">RS&lt;/span>&lt;span style="color:#719e07">=&lt;/span>&lt;span style="color:#2aa198">&amp;#34;[^&amp;#34;&lt;/span>x&lt;span style="color:#2aa198">&amp;#34;]+&amp;#34;&lt;/span> }
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>{ &lt;span style="color:#719e07">++&lt;/span>NWORDS[&lt;span style="color:#268bd2">tolower&lt;/span>(&lt;span style="color:#719e07">$&lt;/span>&lt;span style="color:#2aa198">1&lt;/span>)] }
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#b58900">END&lt;/span>{ &lt;span style="color:#268bd2">print&lt;/span> (word &lt;span style="color:#719e07">in&lt;/span> NWORDS) ? word : &lt;span style="color:#2aa198">&amp;#34;correct(&amp;#34;&lt;/span>word&lt;span style="color:#2aa198">&amp;#34;)=&amp;gt; &amp;#34;&lt;/span> correct(&lt;span style="color:#268bd2">tolower&lt;/span>(word)) }
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This is my version of the Norvig&amp;rsquo;s Spell Corrector in gnu awk.&lt;/p></description></item><item><title>Manipulando logs com AWK e SED</title><link>http://pacman.blog.br/blog/2008/04/26/manipulando-logs-com-awk-e-sed/</link><pubDate>Sat, 26 Apr 2008 14:41:00 -0300</pubDate><guid>http://pacman.blog.br/blog/2008/04/26/manipulando-logs-com-awk-e-sed/</guid><description>&lt;div class='post'>
Eis que a lista de &lt;a href="http://br.groups.yahoo.com/group/shell-script/">shell script&lt;/a> traz um bom desafio.&lt;br />&lt;br />&lt;cite>Galera, tenho o seguinte log.:&lt;br />&lt;br />AAAA-------------campo_1-------------campo_2-----campo_3----campo_4---------- &lt;br />teste_1 371508787 371547453 38666 testetesteteste&lt;br />&lt;br />BBBB-------------campo_1-------------campo_2-----campo_3----campo_4---------- &lt;br />teste_2 4625081503 4651313710 26232207 testetesteteste&lt;br />&lt;br />Estou a tentar usar o awk com a seguinte função : &lt;br />awk '$1~"teste_" {print $5";"$4}' teste > teste_.csv&lt;br />&lt;br />a funcao busca realmente o que desejo:&lt;br />$5 $4&lt;br />testetesteteste 38666&lt;br />testetesteteste 6232207&lt;br />&lt;br />porem,, gostaria que seprasse da forma:&lt;br />&lt;br />AAAA------------- &lt;br />testetesteteste 38666 &lt;br />BBBB------------- &lt;br />testetesteteste 26232207 &lt;br />&lt;br />Alguém tem uma dica de como fazer?&lt;/cite>&lt;br />&lt;br />Ah... o bom e velho &lt;span style="font-weight:bold;">SED&lt;/span> pode resolver isso&lt;br />&lt;br />&lt;code>$ sed -rn '/(^[^-]+-+).*/{s//\1/;h};&lt;br />/^teste_/{s/.* ([^ ]+) +([^ ]+$)/\2 \1/;x;p;g;p}' arquivo.log&lt;br />AAAA-------------&lt;br />testetesteteste 38666&lt;br />BBBB-------------&lt;br />testetesteteste 26232207&lt;/code>&lt;br />&lt;br />Ok, ok, ta muito complicado, mas veja só:&lt;br />&lt;br />&lt;code>$ sed -rn '/^[^-]+-+/h;/^teste_/{x;p;g;p}' arquivo.log &lt;br />AAAA-------------campo_1-------------campo_2-----campo_3----campo_4----------&lt;br />teste_1 371508787 371547453 38666 testetesteteste&lt;br />BBBB-------------campo_1-------------campo_2-----campo_3----campo_4----------&lt;br />teste_2 4625081503 4651313710 26232207 testetesteteste&lt;/code>&lt;br />&lt;br />Vamos explicar&lt;br />1) a opção -n serve para informar ao sed "imprima apenas quando eu mandar"&lt;br />2) a opção -p serve para utilizar expressões regulares extendidas&lt;br />(assim não preciso escapar o quantificador + , que significa "um ou&lt;br />mais vezes", assim como os parentesis, para informar os grupos).&lt;br />&lt;br />Eu fiz uma sacanagem. o comando h quarda o padrão num espaço chamado espaço reserva, tipo uma memória do sed, sobreescrevendo. Assim no espaço reserva eu tenho a ultima ocorrencia de uma linha do tipo, ^[^-]+-+ ,que traduzindo significa: tudo o que começa com um ou varios caracteres diferentes de -, seguidos de um ou varios - (no caso&lt;br />do AAAA------------- ... ).&lt;br />&lt;br />Agora, quando eu encontro uma linha que começa com teste_ eu:&lt;br />&lt;br />x) troco essa linha com a linha que esta na memória (a atual&lt;br />'teste_...' vai, outra volta).&lt;br />p) imprimo a linha que veio (AAAA---------- ...)&lt;br />g) pego a linha da memória (teste_...)&lt;br />p) imprimo a linha cachorrona&lt;br />&lt;br />Só que não fica como vc quer. Ai vc precisa fazer a sacanagem:&lt;br />&lt;br />&lt;span style="font-style:italic;">se uma linha NÃO tem o que eu quero, então eu a manipulo habilmente&lt;br />até que ela chegue ao que eu quero&lt;/span>&lt;br />&lt;br />Eu poderia ter usado varias tecnicas mas... uma vez com sed, podemos continuar nele.&lt;br />&lt;br />&lt;code>$ sed -rn '/(^[^-]+-+).*/{s//\1/;h};&lt;br />/^teste_/{s/.* ([^ ]+) +([^ ]+$)/\2 \1/;x;p;g;p}' arquivo.log&lt;/code>&lt;br />&lt;br />eu transformei a primera ER em (minha_ER).* -- ou seja, criei um &lt;span style="font-style:italic;">grupo&lt;/span> para o que me interessa. basta fazer:&lt;br />&lt;br />&lt;code>s/(minha_ER).*/\1/&lt;/code>&lt;br />&lt;br />para que toda a linha seja reduzida ao que a minha ER casa. em outras palavras, eu apaguei o resto da linha.&lt;br />&lt;br />na outra eu fui mais sacana pois eu tenho 2 grupos e troco toda a linha pelos grupos, na ordem inversa. coisa de quem toma muito café e não tem escrupulos.&lt;br />&lt;br />Vamos ver a versão &lt;span style="font-weight:bold;">AWK&lt;/span>?&lt;br />&lt;br />&lt;code>$ awk '/^[^-]+-+/{match($0,/^[^-]+-+/); x=substr($0,1,RLENGTH)}&lt;br />/^teste_/{print x,"\n"$5,$4}' arquivo.log&lt;br />AAAA-------------&lt;br />testetesteteste 38666&lt;br />BBBB-------------&lt;br />testetesteteste 26232207&lt;/code>&lt;br />&lt;br />x, nesse caso, armazena aquele pedaço da linha anterior, que eu descobri o que é via match. match procura uma expressão regular numa string, nesse caso em $0, e seta um valor na variavel RLENGTH, que é onde a expressão acaba. basta pegar essa parte da string e guardar na variavel x, que sera lida depois.&lt;br />&lt;br />Aqui fala um pouco dessas duas funções: &lt;a href="http://people.cs.uu.nl/piet/docs/nawk/nawk_92.html">http://people.cs.uu.nl/piet/docs/nawk/nawk_92.html&lt;/a>&lt;br />&lt;br />Eu poderia ter resolvido dessa forma também&lt;br />&lt;code>$ awk '/^[^-]+-+/{sub(/-[^-]+.*$/,"-");x=$0} &lt;br />/^teste_/{print x,"\n"$5,$4}' arquivo.log&lt;br />AAAA-------------&lt;br />testetesteteste 38666&lt;br />BBBB-------------&lt;br />testetesteteste 26232207&lt;/code>&lt;br />&lt;br />Entretanto aqui eu faço uma substituição grosseira do resto da linha que tem o AAAA------... por -, abusando do .* (e o fato dele ser guloso). Parece mais simples, mas está sujeito à falhas, embora não consigo pensar em nenhuma situação que seja possivem demonstrar.&lt;br />&lt;br />AWK &amp; SED são ferramentas sensacionais para esse tipo de problema ;-)&lt;/div>
&lt;h2>Comments&lt;/h2>
&lt;div class='comments'>
&lt;div class='comment'>
&lt;div class='author'>blpsilva&lt;/div>
&lt;div class='content'>
Impressive, to say the least :)&lt;BR/>&lt;BR/>Acho que chegou a hora de limpar a minha ferrugem e reler o Advanced Bash Scripting Guide.&lt;BR/>&lt;BR/>You produce some quite nice pearls inside the shell ;)&lt;/div>
&lt;/div>
&lt;div class='comment'>
&lt;div class='author'>Tiago Peczenyj&lt;/div>
&lt;div class='content'>
grep + awk + sed:&lt;BR/>&lt;BR/>$ grep -B 1 teste_ arquivo.log | \&lt;BR/>awk '/teste_/{print $5,$4; next} 1' | \&lt;BR/>sed -r '/^--$/d;s/(^[^-]+-+)[^-].*/\1/'&lt;BR/>&lt;BR/>AAAA-------------&lt;BR/>testetesteteste 38666&lt;BR/>BBBB-------------&lt;BR/>testetesteteste 26232207&lt;/div>
&lt;/div>
&lt;/div></description></item><item><title>Um corretor ortográfico em gawk</title><link>http://pacman.blog.br/blog/2008/04/13/um-corretor-ortogrfico-em-gawk/</link><pubDate>Sun, 13 Apr 2008 14:39:00 -0300</pubDate><guid>http://pacman.blog.br/blog/2008/04/13/um-corretor-ortogrfico-em-gawk/</guid><description>&lt;div class='post'>
Ano passado eu publiquei &lt;a href="http://peczenyj.blogspot.com/2007/08/implementando-um-corretor-ortogrfico.html">uma pequena nota sobre um pequeno corretor ortográfico feito em Python&lt;/a>.&lt;br />&lt;br />No &lt;a href="http://norvig.com/spell-correct.html">artigo&lt;/a> do Peter Norwig, ele explica o principio estatístico do algoritmo. No final, ele mostra varias implementações do algoritmo (em D, Java, Ruby e até Erlang).&lt;br />&lt;br />Depois de muito pesquisar, decidi fazer uma versão em gawk. A primeira tinha 30 linhas e não funcionava muito bem, arrumando e testando cheguei a esta forma final com apenas 15 linhas.&lt;br />&lt;br />Eu chamo de linha um &lt;span style="font-style:italic;">statement&lt;/span> completo do awk. Perceba que nenhuma linha dessas possui o separador de statement &lt;span style="font-weight:bold;">;&lt;/span> (ponto-e-virgula), exceto quando estou utilizando o for no estilo C.&lt;br />&lt;br />&lt;pre>&lt;code># Usage: gawk -v word=something -f thisfile.awk [ big.txt [ big2.txt ... ]]&lt;br /># Gawk version with 15 lines -- 04/13/2008&lt;br /># Author: tiago (dot) peczenyj (at) gmail (dot) com &lt;br /># Based on : http://norvig.com/spell-correct.html&lt;br />function edits(w,max,candidates,list, i,j){&lt;br /> for(i=0;i&lt; max ;++i) ++list[substr(w,0,i) substr(w,i+2)] &lt;br /> for(i=0;i&lt; max-1;++i) ++list[substr(w,0,i) substr(w,i+2,1) substr(w,i+1,1) substr(w,i+3)] &lt;br /> for(i=0;i&lt; max ;++i) for(j in alpha) ++list[substr(w,0,i) alpha[j] substr(w,i+2)] &lt;br /> for(i=0;i&lt;= max ;++i) for(j in alpha) ++list[substr(w,0,i) alpha[j] substr(w,i+1)] &lt;br /> for(i in list) if(i in NWORDS) candidates[i] = NWORDS[i] }&lt;br />&lt;br />function correct(word ,candidates,i,list,max,temp){&lt;br /> edits(word,length(word),candidates,list)&lt;br /> if (!asort(candidates,temp)) for(i in list) edits(i,length(i),candidates)&lt;br /> return (max = asorti(candidates)) ? candidates[max] : word }&lt;br />&lt;br />BEGIN{ if (ARGC == 1) ARGV[ARGC++] = "big.txt" # http://norvig.com/big.txt&lt;br /> while(++i&lt;=length(x="abcdefghijklmnopqrstuvwxyz")) alpha[i]=substr(x,i,1)&lt;br /> IGNORECASE=RS="[^"x"]+" }&lt;br />&lt;br />{ ++NWORDS[tolower($1)] }&lt;br />&lt;br />END{ print (word in NWORDS) ? word : "correct("word")=> " correct(tolower(word)) }&lt;/code>&lt;/pre>&lt;br />&lt;br />Veja o script em funcionamento:&lt;br />&lt;pre>$ time gawk -v word=somethink -f spelling.awk&lt;br />correct(somethink)=> something&lt;br />&lt;br />real 0m4.862s&lt;br />user 0m4.702s&lt;br />sys 0m0.093s&lt;/pre>&lt;/div>
&lt;h2>Comments&lt;/h2>
&lt;div class='comments'>
&lt;div class='comment'>
&lt;div class='author'>Rael&lt;/div>
&lt;div class='content'>
Tiago, mais uma vez, parabéns!&lt;BR/>É muito divertido mexer com estas coisas, não?&lt;BR/>Ah, eu não esqueci de te mandar a versão otimizada em Java... eu só não achei seu email pra enviar! :P&lt;BR/>Me manda um email, e eu te dou reply!&lt;/div>
&lt;/div>
&lt;div class='comment'>
&lt;div class='author'>Tiago Albineli Motta&lt;/div>
&lt;div class='content'>
Corretor ortográfico, tatuagem... esse foi um final de semana divertido pra você heim! hahahha&lt;/div>
&lt;/div>
&lt;/div></description></item></channel></rss>