Convert TRES CSV to true tab-delimited or TMX
Thread poster: Samuel Murray
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 22:09
Member (2006)
English to Afrikaans
+ ...
Jun 20, 2016

Hello everyone

I'd like to convert TRES CSV files to a TM format that I can use in my tools (notably Wordfast Classic), and that means that I should convert these CSV files to "true" tab-delimited files or (if that's not possible) to TMX. And I'd like to do this simply -- commandline would be best, since I have 170 of these files to process. I use Windows 7.

The TRES CSV format is as follows:

- UTF-16 LE
- All fields are tab-delimited
- No fie
... See more
Hello everyone

I'd like to convert TRES CSV files to a TM format that I can use in my tools (notably Wordfast Classic), and that means that I should convert these CSV files to "true" tab-delimited files or (if that's not possible) to TMX. And I'd like to do this simply -- commandline would be best, since I have 170 of these files to process. I use Windows 7.

The TRES CSV format is as follows:

- UTF-16 LE
- All fields are tab-delimited
- No fields are quoted, except:
-- if the field itself contains quotes
-- if the field itself contains line breaks or tabs
-- if the field itself contains a comma
- For fields that contain quotes, the field itself is quoted, and quotes within the field are doubled up
- Empty fields are simply empty (i.e. a doubled tab, with no quotes)
- An empty field at the start of a record is still considered a field
- Records are not required to all have the same number of fields
- There is a header record (but I don't require that the header is treated as a header)

Essentially, for "true" tab-delimitedness, I'd like line breaks and tabs within fields to be replaced with e.g. @LF and @TAB, and I'd like quoted fields to be made unquoted and their doubled quotes made into single quotes.

For creating a TMX file, it must be noted that in TRES CSV, field 1 is the source text, field 2 is the target text, and fields 3, 4 and 5 are the TU attributes. I would be happy if fields 3, 4 and 5 are simply concatenated with e.g. pipes between them.

Does anyone have or know of a utility to do this with?

Thanks
Samuel


[Edited at 2016-06-20 21:20 GMT]
Collapse


 
Dan Lucas
Dan Lucas  Identity Verified
United Kingdom
Local time: 21:09
Member (2014)
Japanese to English
Tools Jun 20, 2016

Samuel Murray wrote:
Essentially, for "true" tab-delimitedness, I'd like line breaks and tabs within fields to be replaced with e.g. @LF and @TAB, and I'd like quoted fields to be made unquoted and their doubled quotes made into single quotes.

Well, Perl was my first thought, but if it boils down to the above then wouldn't a regex-based search and replace tool do the job?

Something like (off the top of my head) s/(\t*?)\n(*?\t)/\1@LF\2/ to replace the line feeds, for example.

I use PowerGREP - which has a good support forum - and it has nice graphical front end for regex construction, preview and testing. Also has good undo facilities. It will churn through thousands of files if you want it to.

What software is it that generates these TRES CSV files?

Dan


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 22:09
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
@Dan Jun 20, 2016

Dan Lucas wrote:
Something like (off the top of my head) s/(\t*?)\n(*?\t)/\1@LF\2/ to replace the line feeds, for example.


I'm not sure if the line breaks within fields would always be \n and that line breaks at the end of records would always be \r\n. For all I know, the line breaks within fields are also \r\n.

Won't your regex join all records into one super-long record? (-: My first thought was regex as well, but it's a little too complex for me.

What software is it that generates these TRES CSV files?


TRES CSV files are created by Microsoft's internal localisation system. I have no idea what program people normally use to open these files (apart from Microsoft's internal localisation tools), though I suspect Microsoft Excel.


 
Dan Lucas
Dan Lucas  Identity Verified
United Kingdom
Local time: 21:09
Member (2014)
Japanese to English
Do several passes Jun 20, 2016

Samuel Murray wrote:
Won't your regex join all records into one super-long record? (-:

Quite possibly - these haven't been tested, just offered as examples.
My first thought was regex as well, but it's a little too complex for me.

Yes, regexes are the classic "now you have two problems", but they still may be your best bet. Presumably there must be some regularity to these files. For example, the end of each line in these files is marked with something consistent, probably a \r\n combination if they are windows files. So provided you don't zap these you should be okay.

If you're nervous about regexes, get the trial of Regex Buddy from the same developer and see how you like it. It's a lot cheaper than PowerGREP at only 30 euros, which surely a job of this size would make worthwhile. You could build the regexes using Regex Buddy and use a free grep to do the actual replacements.

I'm not sure if the line breaks within fields would always be \n and that line breaks at the end of records would always be \r\n. For all I know, the line breaks within fields are also \r\n.

Tricky. If we can't define the problem unambiguously then we can't automate the processing. But if there are only a small number of alternatives, then using the "|" might do the job of specifying alternatives. Observation of the files should help you work out what is really needed.

In the case of PowerGREP you can create a number of actions and chain them together in a sequence, which allows you to break the problem into pieces.

Line feeds first. You say that fields with tabs/LF are quoted. Note the vertical bar specifying either \n OR \r:
s/(\t"*?)\n|\r(.+?\"t)/\1@LF\2/

Then tabs:
s/(\t"*?)\t(*?"\t)/\1@TAB\2

Then quotes around fields i.e. those adjacent to the field delimiters, something like:
s/(\t)"(.+?)"(\t)/\1\2\3/

Then double quotes:
s/""/"/

...and so on. Note that PowerGREP doesn't actually use the s/a/b/ substitution syntax given above, but the principle is similar. If you're using a free grep you can build and test each regex, then chain them together in a batch file.

Ultimately projects like this are a trade-off between the size of the job in hand, complexity of the solution, time required and the prospect of future similar jobs, which would offer the opportunity to reuse the tools, thus getting greater return on time invested.

Dan


 
Philippe Etienne
Philippe Etienne  Identity Verified
Spain
Local time: 22:09
Member
English to French
Searching all CSVs at once Jun 21, 2016

Samuel Murray wrote:
...I have no idea what program people normally use to open these files (apart from Microsoft's internal localisation tools), though I suspect Microsoft Excel.

I used to use SR32 from Funduc Software.

I've also converted MS csv files very manually to TMX, with only source and target (no attributes whatsoever) - Patent pending:
remove all irrelevant columns in Excel
convert to create a 2 column source/target table in Word
convert table to text
Search/replace in Word to replace tabs with the Trados pre-2009 middle tags and hard breaks with Trados pre-2009 begin/end tags and another hard break somewhere: this leads to a pre-2009 bilingual Trados Word file.
cater for the top and end of file to adjust Trados tabs so that cleaning doesn't return errors.
clean the resulting old .doc file with Trados pre-2009
convert Trados TM to TMX in Workbench
do that to all MS .csv files you can put your hands on, and load resulting TMs into TM-town to get all Microsoft jobs

Escape characters are left as is, and segments are not necessarily sentences and can be very long, like the Help pages at the command prompt with a ton of codes (line breaks, tabs, etc.).

Philippe




[Edited at 2016-06-22 06:16 GMT]


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 22:09
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
Okay, my own little solution Jun 21, 2016

Dan Lucas wrote:
If you're nervous about regexes, get the trial of Regex Buddy from the same developer and see how you like it. It's a lot cheaper than PowerGREP at only 30 euros, which surely a job of this size would make worthwhile.


Thanks, I'll have a look.

Ultimately projects like this are a trade-off between the size of the job in hand, complexity of the solution, time required and the prospect of future similar jobs, which would offer the opportunity to reuse the tools, thus getting greater return on time invested.


Yup (-:

Philippe Etienne wrote:
I've also converted MS csv files very manually to TMX, with only source and target (no attributes whatsoever) - Patent pending...


I have done similar things in the past.

For now, I have decided to make the assumption that no fields contain line breaks or tabs, and to process the CSV files using an AutoIt script. This assumption catches most lines anyway. The AutoIt script is here:

http://leuce.com/autoit/twotres.zip (also included is a script to help download the TRES files)


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Convert TRES CSV to true tab-delimited or TMX






CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

Buy now! »
Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »