Previous Thread
Next Thread
Print Thread
input csv unicode #33615 29 Oct 20 07:30 PM
Joined: Jun 2001
Posts: 3,406
J
Jorge Tavares - UmZero Online Content OP
Member
OP Online Content
Member
J
Joined: Jun 2001
Posts: 3,406
Hi,
I don't know if this is an A-Shell problem, I'm just posting here in case we have any workaround that can speed up the solution.

In resume, I'm importing CSV files generated in KOFAX (maybe some of you know this scanning package) and they produce it in Unicode.
If I read them using "input csv" I got nothing, I must open in notepad and save it in ANSI or UTF-8 and, from there, everything runs smoothly.
I already asked to Kofax guys if they can save it in one of those two formats but, they spent already two days w/o replying and, from the previous steps of this project, I can see how limited they are.

It's not a nightmare to open and save those files but, from what's supposed to be an automatic process, have to start it with this kind of step, it's not very elegant.

So, any magic on our side to solve this?

thanks in advance


Jorge Tavares

UmZero - SoftwareHouse
Brasil/Portugal
Re: input csv unicode [Re: Jorge Tavares - UmZero] #33616 30 Oct 20 01:22 AM
Joined: Jun 2001
Posts: 11,794
J
Jack McGregor Offline
Member
Offline
Member
J
Joined: Jun 2001
Posts: 11,794
The name Kofax brings back memories - way back in the 90's Ty and I were involved in an imaging product called Co-Star which Alpha Micro briefly sold, and Kofax was the preferred high-end scanning solution. But that's not much help to us now.

It seems like we should have already solved this problem, but I guess not. XTREE is now entirely in Unicode (thanks to a request by you relating to some Russian language translations), with translations back and forth between Unicode, UTF-8 and Ansi, so it doesn't seem like too much of a stretch. The question is exactly how to implement it.

From the application perspective, the ideal would if it were transparent, i.e. the conversion was automatic within the INPUT statement. But I'm afraid that will be messy to implement, as INPUT is extremely complex, and the conversion from Unicode to ANSI/Latin1 (which we normally use) is not guaranteed. (UTF-8 would be, but existing applications are going to choke when they run into UTF-8 multi-byte characters.) So I'm thinking that some kind of standalone utility, perhaps an XCALL, to convert the file, or maybe one line of the file at a time, would be more straightforward, with less risk of ripple effects. You still have the problem of what to do with Unicode characters that require multiple bytes to translate into UTF-8, or which can't be handled in ANSI/Latin1 at all. Converting to UTF-8 and not worrying about that would be the simplest. But probably more practical would be to convert to ANSI/Latin1 and replace untranslatable characters with something like "?".

I'm kind of backed up right now but could imagine getting something done in the next few days, once we settle on just what it should be.

Re: input csv unicode [Re: Jorge Tavares - UmZero] #33617 30 Oct 20 09:17 AM
Joined: Jun 2001
Posts: 3,406
J
Jorge Tavares - UmZero Online Content OP
Member
OP Online Content
Member
J
Joined: Jun 2001
Posts: 3,406
Hi Jack,
I remember Co-Star and participated in one, and only, implementation of it in Portugal, in the early 90's
And, yes, I believe we are talking about the same Kofax which, regarding the data handling look like suspended on time.

As I mentioned, I believe they have the option to save it in ANSI already but, after three days of my request, they should still be looking for that option around, the same one that I saw in their dashboard at the beginning of the project when they sent a screenshot (I believe, it was a 2MB one, inside of a Word document) cry

So, regarding to this particular case, an XCALL to just save the file in ANSI would be more than enough, better would be if we could have a GET_ENDCODING to know if it needs convertion and a SAVE_ENDCODING.
But. considerig that I have a few settings to import the Kofax files, addiing another one to inform the expected endcoding would the replace the need to have the GET_ENDCODING safely because they will not produce files in different formats.

Obviously that have this fully handled by INPUT and A-Shell globally would be the perfect option but, put a lot of work on this and add hipotetical problems to solve ahead, I'm strongly sure that don't worth it.

Thanks for the reply and (usual) availability to take care of this.

PS: considering all the details I solved for them after designing the solution, adding this one more, probably they announce A-Shell as the best KOFAX integration platform wink


Last edited by Jorge Tavares - UmZero; 30 Oct 20 09:18 AM.

Jorge Tavares

UmZero - SoftwareHouse
Brasil/Portugal
Re: input csv unicode [Re: Jorge Tavares - UmZero] #33618 31 Oct 20 06:39 PM
Joined: Jun 2001
Posts: 11,794
J
Jack McGregor Offline
Member
Offline
Member
J
Joined: Jun 2001
Posts: 11,794
Yes, I think a joint marketing campaign is the least they could do for us! We might even give them a discount if they bundle an A-Shell with every Kofax board.

But as for the GET_ENCODING, unfortunately, like so many other standards, there's some fuzziness around the outer edges. In this case, while the rules of the Unicode encoding are fairly rigid and precise, that's not true when it comes to how a file is supposed to indicate the encoding of the character within. Ideally it would be part of the file metadata, but as there is no standard for that, it comes down to inserting some telltale magic bytes at the start of the file. Most Unicode files will have the so-called BOM
(Byte Order Mark) at the start, but there's no particular enforcement of that rule.

I would suggest the creation of a Fn'GetEncoding(file$) function that uses XCALL GET to sample the first few bytes of the file to try to determine whether it is a valid BOM and what kind, or perhaps even to do some statistical sampling of the bytes to see if the "look like" Unicode. That would be more flexible than a routine embedded in A-Shell. Once you have confidence in the identification of the encoding, then it becomes more practical to create an embedded XCALL to convert the file from one encoding to another.

In the meantime, I'm gearing up to head out to the polling station. First I have to pack my supplies (food, clothing, blankets, battery packs, masks, weapons, ...) eek

Re: input csv unicode [Re: Jorge Tavares - UmZero] #33619 01 Nov 20 08:46 AM
Joined: Jun 2001
Posts: 3,406
J
Jorge Tavares - UmZero Online Content OP
Member
OP Online Content
Member
J
Joined: Jun 2001
Posts: 3,406
...and your own pen to guarantee that the ink stays in the right squares


Jorge Tavares

UmZero - SoftwareHouse
Brasil/Portugal
Re: input csv unicode [Re: Jorge Tavares - UmZero] #33620 01 Nov 20 05:31 PM
Joined: Jun 2001
Posts: 11,794
J
Jack McGregor Offline
Member
Offline
Member
J
Joined: Jun 2001
Posts: 11,794
Well, I'm happy to say I survived voting in person. It wasn't actually that traumatic, which just goes to show how unevenly government services are distributed across the country. Even though it was the longest I've waited to vote in my entire life, it was still only about 30 minutes. And sadly, much of the delay was due to software problems in the check-in process! cry The name lookup logic apparently could not handle a complicated last name like McGregor - they tried every combination they could think of (MCG, MC G, MC, MAC, etc.) but it wasn't until the onsite tech support guy figured out how to get into the "advanced search" and managed to find me by just looking at all the M's and then filtering by address that I finally got my blank ballot and was directed to the actual voting stations. (There were about 50 of them, most of them vacant, since all of the bottleneck was at the check-in). And for better or worse (almost certainly the latter, as far as the taxpayer is concerned), they've done away with the old-fashioned ink pen marking devices and gone to giant touch-screens, each station with it's own printer and scanner. So you scan in your blank ballot, use the touch-screen to fill it out, it then gets printed, spit out, re-inserted to be scanned and then transmitted probably via Google, the NSA, the FSB, and who knows who else before hopefully arriving in the official electoral bit bucket with most of the bits intact. confused (Hopefully they haven't thrown away the old ink-blot devices as we'll probably have to go back to them when these newfangled devices break down and become too expensive to maintain. Assuming of course they don't take the obvious step of re-implementing the entire thing in XTREE.)

Anyway, in the relative calm before the much-feared political storm to come, I've put together a function that converts Unicode (U16 Windows format) to either ANSI or UTF8. It could probably use some refinement, but the beauty of it is that it's in A-ShellBASIC, so it doesn't depend on updating A-Shell, and it's easy to modify. (It is limited to the Windows platform though, since it makes a call to the KERNEL32.DLL via DYNLIB.SBR.) I posted a preliminary copy here: fnunicode.bsi

There's a test routine included in the bsi which you can activate by compiling as follows:

Code
.compil fnunicode.bsi/x:2/m/px/C:FNUNICODE_BSI_TEST_=1


To test, I loaded the ash65notes.txt into notepad and then saved it as ash65notes.u16 in Unicode format and then...
Code
.RUN FNUNICODE
Test Fn'U16'to'MCBS'File()
U16 file to convert from: SYS:ASH65NOTES.U16
BOM = FEFF Windows U16 - OK
File to convert to (blank to quit): test.txt
Codepage [0]:
Flags [0]:
ASCII value of default char (63):
Return value (bytes output) =  342874
Dfltused:  0
1) to view output file:


If all goes well (which it did for me) the result is identical to the original ash65notes.txt.

Note that the test program starts by calling a separate function to check the BOM (Byte Order Mark) of the file to make sure it's the expected FE FF. (If you try to convert a file that isn't in U16 format, the results won't be pretty.) Over time, I can imagine adding a function to go the other way, or maybe a more generalized one that will go from any format to any other format. But this should be enough to help you seal the worldwide joint-marketing deal with Kofax. laugh

Re: input csv unicode [Re: Jorge Tavares - UmZero] #33621 01 Nov 20 06:22 PM
Joined: Mar 2005
Posts: 494
Ty Griffin Offline
Member
Offline
Member
Joined: Mar 2005
Posts: 494
I wonder about which version of the voting software is being used by Los Angeles County. The latest developer's version? The stable version? Not the latest developer's version but the one released a few months ago because they need the xyz feature? The one from four years ago because they just can't bear the idea of an update?

Re: input csv unicode [Re: Jorge Tavares - UmZero] #33622 01 Nov 20 06:26 PM
Joined: Sep 2003
Posts: 4,158
Steve - Caliq Offline
Member
Offline
Member
Joined: Sep 2003
Posts: 4,158
I’m surprised they not installed the latest multiple-checkbox xtree version? Choose as many candidates as you want..or Frank’s drop and drag to reorder your preference...

Re: input csv unicode [Re: Jorge Tavares - UmZero] #33623 01 Nov 20 10:19 PM
Joined: Jun 2001
Posts: 3,406
J
Jorge Tavares - UmZero Online Content OP
Member
OP Online Content
Member
J
Joined: Jun 2001
Posts: 3,406
Now I'm really scared about that result in the elections cry
Are you saying that you start scannig the blank form that you received from one guy that took half an hour to find you on the system.
Then, you put the crosses using a touch-screen.
Then you print the filled form.
And, finally, you have to scan that paper which digital interpretation is sent as your final and secret vote to be computed somewhere in the Universe!!!!!!

Do you have statistics about, how many voters don't complete the process?
How many left the initial form filled by pen in the scanner?
How many left their vote on the screen and leave?
How many gave up in one of the steps of the process?
How many commited suicide in the process?

My God, only one thing come into my mind, the cherry on top of the cake would be if the final step, scan, read and send the result is a Kofax system cry

PS: Maybe you should send an urgent email to the votes counting Central, asking if they need a function to convert the received result to something understandable.


By the way, thank you very much for the solution, after recovering my breath from the news, I'll try it and give you feedback.
Many, many thanks.


Jorge Tavares

UmZero - SoftwareHouse
Brasil/Portugal
Re: input csv unicode [Re: Jorge Tavares - UmZero] #33624 01 Nov 20 10:46 PM
Joined: Jun 2001
Posts: 11,794
J
Jack McGregor Offline
Member
Offline
Member
J
Joined: Jun 2001
Posts: 11,794
And you wonder how our current President got elected????

Fortunately, I anticipated all of those concerns, which is why I skipped all of the checkboxes and then used my sweaty fingertip to handwrite "Jorge Tavares" in the write-in section for position of Election Czar. (I'm afraid they may need more than Kofax to read my touch-screen handwriting though.)

Re: input csv unicode [Re: Jorge Tavares - UmZero] #33630 04 Nov 20 10:38 PM
Joined: Jun 2001
Posts: 3,406
J
Jorge Tavares - UmZero Online Content OP
Member
OP Online Content
Member
J
Joined: Jun 2001
Posts: 3,406
Sorry for the interruption but, considering the high probability for a lot of vote recount around there, you should not have much to announce for a while so, I take the chance for a breaking news to say that, this convertion ran like a charm.
It's a little detail that made an huge difference for the user.
Much appreciated the solution and the way you did it, which took me a couple of minutes to embbed in my program.

Good luck there


Jorge Tavares

UmZero - SoftwareHouse
Brasil/Portugal
Re: input csv unicode [Re: Jorge Tavares - UmZero] #33631 05 Nov 20 02:58 PM
Joined: Sep 2002
Posts: 5,471
F
Frank Online Content
Member
Online Content
Member
F
Joined: Sep 2002
Posts: 5,471
Jorge - i think a lot of people are in their bunkers... may be a while before its safe to venture out! eek

Re: input csv unicode [Re: Jorge Tavares - UmZero] #34755 22 Nov 21 08:51 AM
Joined: Jun 2001
Posts: 3,406
J
Jorge Tavares - UmZero Online Content OP
Member
OP Online Content
Member
J
Joined: Jun 2001
Posts: 3,406
Hi Jack,

It seems that changes in DYNLIB (6.5.1708.0) broke the UTF conversion.
Let me know if this is enough or if you need any special debug or file to test.

I've reverted to the previous version, no hurry.
Thanks


Jorge Tavares

UmZero - SoftwareHouse
Brasil/Portugal
Re: input csv unicode [Re: Jorge Tavares - UmZero] #34756 22 Nov 21 04:14 PM
Joined: Jun 2001
Posts: 11,794
J
Jack McGregor Offline
Member
Offline
Member
J
Joined: Jun 2001
Posts: 11,794
Hi Jorge,

I'm a little confused here -- we are talking about DYNLIB.SBR and not INPUT CSV, right? A/a parameter types?

There was definitely some recent tinkering in DYNLIB related to 64 bit support (parameter types l, L, h), but after some review, I don't see how it impacted use of the A/a parameter types (for ANSI to UTF8 conversions).

I don't have a handy test DLL that either consumes or returns UTF8 parameters, so I'm not quite sure how to test. (If you have one that can be used in a standalone way, please share it!) But I can verify that routines that use standard ANSI parameters (type Z/z) work fine when the parameter type is changed to A/a (which only proves that the ANSI-UTF8 conversion doesn't break ANSI strings that don't contain characters requiring multi-byte conversion).

BTW, did you run your half-marathon yesterday?

Re: input csv unicode [Re: Jorge Tavares - UmZero] #34757 22 Nov 21 04:59 PM
Joined: Jun 2001
Posts: 3,406
J
Jorge Tavares - UmZero Online Content OP
Member
OP Online Content
Member
J
Joined: Jun 2001
Posts: 3,406
Hi Jack,

I've attached a CSV needing conversion and below is my code, calling your functions (funicode.bpi) that takes care of everything.
I'm getting "A conversão não foi bem sucedida." after the call of Fn'U16'to'MCBS'File()

Apologize if I'm giving you just pieces instead of a ready example, you're much better than me doing that, but let me know if you need anything more.

Code

      encoding = Fn'U16'BOM(csv'file$)
      if encoding=BOM_U16_LSB then
         xcall sbxmsg, MSG'EXIT, "Ficheiro em Windows Unicode16."+CRLF$+"Vai ser convertido para UTF8.", "Conversão de encoding", OK_CANCEL, INFORMACAO
         if MSG'EXIT=0 then
            if Fn'U16'to'MCBS'File(csv'file$, csv'file$[1, -4]+"txt", codepage=0, flags=0, dfltchar=63)=0 then
               xcall sbxmsg, 0, "A conversão não foi bem sucedida.", "Conversão de encoding", 0, CRUZ
               EXITFUNCTION
            else
               csv'file$ = csv'file$[1, -4]+"txt"
            endif
         else
            EXITFUNCTION
         endif
      elseif encoding=BOM_U16_MSB then
         xcall sbxmsg, 0, "Ficheiro em MSB Unicode16."+CRLF$+"Conversão não suportada."+CRLF$+"Abra o ficheiro no notepad e grave em UFT-8.", "Conversão de encoding", 0, CRUZ
         EXITFUNCTION         
      endif
   




As for the half-marathon, yes, but not yesterday, it was on the previous Sunday, 14th and it was great, the weather conditions were perfect, sunny but not hot, between 17-20ºC considering the start (at 5:50 AM) and the finish line at 7:45 so, I reached my goal to break my record and the 2:00 hours barrier with the official time of 1:55:01 corresponding to the 1268 position on the Men ranking in a total 3394 participants and the 33 place between 130 in my age slot (55-59).
I'm happy because that closed the season of drinks abstinence . laugh
No matter the results, the path is very beatiful, I've attached some pictures I bought from the offical reporters.

Attached Files
Batch041.zip (1.44 KB, 68 downloads)
SHA1: c9af8d561f0718b51d096d114fd970d02c7ab756
6743663.jpg6743664.jpg6743665.jpg6743666.jpg

Jorge Tavares

UmZero - SoftwareHouse
Brasil/Portugal
Re: input csv unicode [Re: Jorge Tavares - UmZero] #34758 22 Nov 21 06:53 PM
Joined: Sep 2002
Posts: 5,471
F
Frank Online Content
Member
Online Content
Member
F
Joined: Sep 2002
Posts: 5,471
Congrats Jorge well done! You are very focused! You give hope the the old men of the world while at the same time giving me some reason to get off my tail.. though i am more of a walker/hiker my running days are far behind me! Keep up the good work! laugh

Re: input csv unicode [Re: Jorge Tavares - UmZero] #34759 22 Nov 21 07:31 PM
Joined: Jun 2001
Posts: 11,794
J
Jack McGregor Offline
Member
Offline
Member
J
Joined: Jun 2001
Posts: 11,794
Wow - awesome accomplishment! I'm sorry I couldn't have been there to cheer from the sidelines (and especially to help celebrate the end of abstinence!) grin

Kind of hard to get psyched up to dig into character set conversion after that, but thanks for reminding me of the function. (I had forgotten that it even used DYNLIB.) This week I'm juggling both jury duty and grandparent duty (kids now take the entire week of Thanksgiving off), so I'm not sure exactly when I'm going to get to it, but it shouldn't be too long...

Re: input csv unicode [Re: Jorge Tavares - UmZero] #34760 22 Nov 21 08:48 PM
Joined: Jun 2001
Posts: 3,406
J
Jorge Tavares - UmZero Online Content OP
Member
OP Online Content
Member
J
Joined: Jun 2001
Posts: 3,406
Thank you guys, anxyous to drink with you all in person and celebrate many victories together, until there, enjoy a fantastic Thanksgiving week with a lot of family at home to cheer, kiss, hug and do everything we love to do, which is celebrate life with family and friends.
Jack, don't worry, take your time, enjoy the week with the kids, I'm not waiting for that fix.
Big hug

Last edited by Jorge Tavares - UmZero; 22 Nov 21 08:48 PM.

Jorge Tavares

UmZero - SoftwareHouse
Brasil/Portugal
Re: input csv unicode [Re: Jorge Tavares - UmZero] #34761 23 Nov 21 03:26 PM
Joined: Sep 2002
Posts: 5,471
F
Frank Online Content
Member
Online Content
Member
F
Joined: Sep 2002
Posts: 5,471
Well said Jorge! Thanks and best to you and your lovely family! smile


Moderated by  Jack McGregor, Ty Griffin 

Powered by UBB.threads™ PHP Forum Software 7.7.3