shell pipe "loses" parts of the data, was: help needed to debug Perl script

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

shell pipe "loses" parts of the data, was: help needed to debug Perl script

M. Fioretti-2
Hello all, again

(sorry, this is a bit long, but it can't be helped)

in another thread today, I wrote how a Perl script I had been using
for weeks without problems had suddenly started to drive me mad, losing
~80% of the data it was supposed to print.

Basically the script was telling me "I have built a hash with ~26K keys,
and now I am going to print them all, one per line", but instead of ~26k
lines of data, it would only print ~4700. For details, please see my
other thread.

After some painful manual browsing of the input and output data, I have
realized that those missing lines were lost AFTER the script, in a shell
pipe. That Perl script (myscript.pl) is wrapped into a bash one that
filters this output in this way:

myscript.pl 2> error.log | tee datadump | grep ^csvfinal | sort | cut
-c10- >  result.csv

what I have realized only ten minutes ago is that, after weeks working
without problems, the **PIPE** stopped working. Namely:

a) datadump was simply truncated. Like, the script would print +30k
lines,
    and datadump would contain only the first ~6k or so

b) on top of that, grep did not extract all the lines it was supposed to
    Here is what I mean

#> grep -c ^csvfinal datadump
5700 (this is because datadump itself was truncated, see above)

#> grep ^csvfinal datadump | wc -l
4700 (grep extracts LESS lines than it can count)

If, instead, I run these commands manually, one at a time (note the -a
option!!!):

myscript.pl > manualdatadump
grep  -a ^csvfinal manualdatadump  | sort | cut -c10- > result.csv

then result.csv contains all the lines it was supposed to contain.

COMMENTS/QUESTIONS: in hindsight, I wasted time by not looking at the
pipe first simply because the script is much more complex, so I assumed
the fault could only be there. My fault, sorry :-(

BUT: what contributed to my confusion is the fact that everything, pipe
included, had been working for weeks without a hitch. Right now, my
explanation of what happened is that, by pure chance, yesterday:

a) the volume of input data processed and output by the script passed
for
the first time some threshold that makes the buffers used by shell pipes
overflow

b) AND the data also contained, for the first time, non-ascii characters
that
    make grep fail unless the -a option is used

I am not sure at all of what I have just written, and every comment, and
tip
to make sure this does not happen again in some future script is very
welcome.

Marco






-------- Original Message --------
Subject: SOLVED (not completely...): OT: help needed to debug Perl
script
Date: 2018-10-17 18:14
 From: "M. Fioretti" <[hidden email]>
To: [hidden email]
Reply-To: [hidden email]

On 2018-10-17 12:06, Colin Law wrote:

> On Wed, 17 Oct 2018 at 06:07, M. Fioretti <[hidden email]>
> wrote:
>> ...
>>     157     print "\nADDINGURX: $url;\n";
>>     158     print "\nADDINGURQ: $qq;\n";
>> ...
>>     ~4700 lines starting with ADDINGURX
>>     ZERO lines starting with ADDINGURQ
>
> Do you mean that line 157 is printing ok but the output from line 158
> never appears?
> Are you sure there is not another line there somewhere printing
> ADDINGURX?

Answering (indirectly) also to Joel:


the snippet of script that I posted is the part of the actual output of

#> cat -n myscript

So this code, from my original message:

    147 my $keycounter = 1;
    148
    149 foreach my $qtq (sort keys %all) {
    150
    151    printf "\nALLCHECK: %6.6s >> %s;\n", $keycounter, $qtq;
    152    $keycounter++;
    153 }
    154
    155 foreach my $qq (sort keys %all) {
    156    $url = $qq;
    157    print "\nADDINGURX: $url;\n";
    158    print "\nADDINGURQ: $qq;\n";

is lines 147 to 158 of the complete script, and consequently yes, I was
sure that there was no other Perl code at all playing tricks here.

What I have been trying to say, maybe badly, is:

a) the above is part of the actual code
b) I run the script dumping the output to a file, for further
processing:

    #> myscript > datadump

c) and I get different numbers of lines from the three statements
(again,
    what follows is ACTUAL output of grep at the shell prompt):

#> grep -c ^ALLCHECK datadump (=line 151 prints 26080 keys from the
hash)
26080
#> grep -c ^ADDINGURX datadump (=line 157 prints only 4732 keys from the
hash)
473
#> grep -c ^ADDINGURQ datadump (=line 158 prints only 4732 keys from the
hash)
473

now the "solution":

After looking at the whole flow from scratch, I found out that the
problem
seems to be 100% *outside* that specific Perl script, and somehow even
more
confusing (for me at least). But that deserves a different thread,
coming
in a few minutes.

Thanks!!!

Marco

--
http://mfioretti.com

--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
Reply | Threaded
Open this post in threaded view
|

Re: shell pipe "loses" parts of the data, was: help needed to debug Perl script

ubuntu-users mailing list
On Wed, 17 Oct 2018 18:36:18 +0200, M. Fioretti wrote:
>b) AND the data also contained, for the first time, non-ascii
>characters that make grep fail unless the -a option is used
>
>I am not sure at all of what I have just written, and every comment,
>and tip to make sure this does not happen again in some future script
>is very welcome.

Hi,

even 'grep -a' could return undesired output. A way to workaround this
issue might be using the 'strings' command and than piping through
'grep'.

[rocketmouse@archlinux ~]$ grep -a SOURCE /var/log/journal/a243f4e05c294b13ab6972bf4ff93907/system.journal | tail -1
_SOURCE_REALTIME_TIMESTAMP=1539795805565362��R��q�ox!�b���_��I���t'��޿�v�|����8�&�������8���G��H�8�Qk��|���?
[rocketmouse@archlinux ~]$ strings /var/log/journal/a243f4e05c294b13ab6972bf4ff93907/system.journal | grep SOURCE | tail -1
_SOURCE_REALTIME_TIMESTAMP=1539795805565362
[rocketmouse@archlinux ~]$

Regards,
Ralf


--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
Reply | Threaded
Open this post in threaded view
|

Re: shell pipe "loses" parts of the data, was: help needed to debug Perl script

Ken D'Ambrosio
One thing that's bitten me when reading from files (but not sure about
pipes) is failing to close() the file before assuming I'm done reading.  
Bytes were left behind on the table, so to speak.

$.02,

-Ken


On 2018-10-17 13:05, Ralf Mardorf via ubuntu-users wrote:

> On Wed, 17 Oct 2018 18:36:18 +0200, M. Fioretti wrote:
>> b) AND the data also contained, for the first time, non-ascii
>> characters that make grep fail unless the -a option is used
>>
>> I am not sure at all of what I have just written, and every comment,
>> and tip to make sure this does not happen again in some future script
>> is very welcome.
>
> Hi,
>
> even 'grep -a' could return undesired output. A way to workaround this
> issue might be using the 'strings' command and than piping through
> 'grep'.
>
> [rocketmouse@archlinux ~]$ grep -a SOURCE
> /var/log/journal/a243f4e05c294b13ab6972bf4ff93907/system.journal |
> tail -1
> _SOURCE_REALTIME_TIMESTAMP=1539795805565362��R��q�ox!�b���_��I���t'��޿�v�|����8�&�������8���G��H�8�Qk��|���?
> [rocketmouse@archlinux ~]$ strings
> /var/log/journal/a243f4e05c294b13ab6972bf4ff93907/system.journal |
> grep SOURCE | tail -1
> _SOURCE_REALTIME_TIMESTAMP=1539795805565362
> [rocketmouse@archlinux ~]$
>
> Regards,
> Ralf

--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
Reply | Threaded
Open this post in threaded view
|

Re: shell pipe "loses" parts of the data, was: help needed to debug Perl script

ubuntu-users mailing list
In reply to this post by ubuntu-users mailing list
On Wed, 17 Oct 2018 19:05:23 +0200, Ralf Mardorf wrote:

>On Wed, 17 Oct 2018 18:36:18 +0200, M. Fioretti wrote:
>>b) AND the data also contained, for the first time, non-ascii
>>characters that make grep fail unless the -a option is used
>>
>>I am not sure at all of what I have just written, and every comment,
>>and tip to make sure this does not happen again in some future script
>>is very welcome.  
>
>even 'grep -a' could return undesired output. A way to workaround this
>issue might be using the 'strings' command and than piping through
>'grep'.
>
>[rocketmouse@archlinux ~]$ grep -a
>SOURCE /var/log/journal/a243f4e05c294b13ab6972bf4ff93907/system.journal
>| tail -1
>_SOURCE_REALTIME_TIMESTAMP=1539795805565362��R��q�ox!�b���_��I���t'��޿�v�|����8�&�������8���G��H�8�Qk��|���?
>[rocketmouse@archlinux ~]$
>strings /var/log/journal/a243f4e05c294b13ab6972bf4ff93907/system.journal
>| grep SOURCE | tail -1 _SOURCE_REALTIME_TIMESTAMP=1539795805565362
>[rocketmouse@archlinux ~]$

PS:

Sure, without the '-a' option 'grep' might not return any text contained
by the file [1]. Perhaps 'strings' does filter output you want to get.

Since we don't know what you actually try to achieve, we only could
provide shots in the dark. Perhaps 'grep -a' does always the trick for
you, while 'strings' might fail regarding your needs.

[1]
[rocketmouse@archlinux ~]$ grep SOURCE /var/log/journal/a243f4e05c294b13ab6972bf4ff93907/system.journal | tail -1
Binary file /var/log/journal/a243f4e05c294b13ab6972bf4ff93907/system.journal matches
[rocketmouse@archlinux ~]$


--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
Reply | Threaded
Open this post in threaded view
|

Re: shell pipe "loses" parts of the data, was: help needed to debug Perl script

ubuntu-users mailing list
On Wed, 17 Oct 2018 19:24:28 +0200, Ralf Mardorf wrote:

>On Wed, 17 Oct 2018 19:05:23 +0200, Ralf Mardorf wrote:
>>On Wed, 17 Oct 2018 18:36:18 +0200, M. Fioretti wrote:  
>>>b) AND the data also contained, for the first time, non-ascii
>>>characters that make grep fail unless the -a option is used
>>>
>>>I am not sure at all of what I have just written, and every comment,
>>>and tip to make sure this does not happen again in some future script
>>>is very welcome.    
>>
>>even 'grep -a' could return undesired output. A way to workaround this
>>issue might be using the 'strings' command and than piping through
>>'grep'.
>>
>>[rocketmouse@archlinux ~]$ grep -a
>>SOURCE /var/log/journal/a243f4e05c294b13ab6972bf4ff93907/system.journal
>>| tail -1
>>_SOURCE_REALTIME_TIMESTAMP=1539795805565362��R��q�ox!�b���_��I���t'��޿�v�|����8�&�������8���G��H�8�Qk��|���?
>>[rocketmouse@archlinux ~]$
>>strings /var/log/journal/a243f4e05c294b13ab6972bf4ff93907/system.journal
>>| grep SOURCE | tail -1 _SOURCE_REALTIME_TIMESTAMP=1539795805565362
>>[rocketmouse@archlinux ~]$  
>
>PS:
>
>Sure, without the '-a' option 'grep' might not return any text
>contained by the file [1]. Perhaps 'strings' does filter output you
                                                   ^^^^^^

"filter" in the sense of "miss", "not provide", "not return"
------------------------------------------------------------


>want to get.
>
>Since we don't know what you actually try to achieve, we only could
>provide shots in the dark. Perhaps 'grep -a' does always the trick for
>you, while 'strings' might fail regarding your needs.
>
>[1]
>[rocketmouse@archlinux ~]$ grep
>SOURCE /var/log/journal/a243f4e05c294b13ab6972bf4ff93907/system.journal
>| tail -1 Binary
>file /var/log/journal/a243f4e05c294b13ab6972bf4ff93907/system.journal
>matches [rocketmouse@archlinux ~]$


--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
Reply | Threaded
Open this post in threaded view
|

Re: shell pipe "loses" parts of the data, was: help needed to debug Perl script

M. Fioretti-2
In reply to this post by Ken D'Ambrosio
On 2018-10-17 19:20, Ken D'Ambrosio wrote:
> One thing that's bitten me when reading from files (but not sure about
> pipes) is failing to close() the file before assuming I'm done
> reading.  Bytes were left behind on the table, so to speak.

thanks Ken and Ralph for all your suggestions.  Indeed, even in my case
there were bytes left on the table. Odd that it never happened before.
However, I am rewriting the whole flow to avoid these issues, using both
intermediate/temporary files (to avoid the problem Ken describes above)
and the strings command.

Will report if there are more problems, but at this point there
shouldn't
be finger crossed!)

Thanks again,
Marco
--
http://mfioretti.com

--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users
Reply | Threaded
Open this post in threaded view
|

Re: shell pipe "loses" parts of the data, was: help needed to debug Perl script

Paul Sladen-2
In reply to this post by M. Fioretti-2
On Wed, 17 Oct 2018, M. Fioretti wrote:
> myscript.pl 2> error.log | tee datadump | grep ^csvfinal | sort | cut -c10- >  result.csv
> a) datadump was simply truncated.

  grep --line-buffered

> b) on top of that, grep did not extract all the lines

  grep --text  or  grep --null-data

> a) the volume of input data .. passed .. some threshold

4kB/8kB ?

> b) AND the data also contained, for the first time, non-ascii characters

Use UTF-8, and only UTF-8.

> every comment, and b tip

Please provide a .zip/tarball with the exact actual command being run,
and the exact input/intermediate file, and the exact minimally
source code example that exhibits the issue.

   -Paul


--
ubuntu-users mailing list
[hidden email]
Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/ubuntu-users