Trouble with charts in OmegaT
Thread poster: Nina Halperin
Nina Halperin
Nina Halperin  Identity Verified
Peru
Local time: 17:55
Spanish to English
+ ...
Jul 21, 2020

Hello,

I'm completing a practice project in OmegaT 4.3.2, in which I am translating a Word document that I had converted from a PDF using Wordfast Anywhere. Although almost all the formatting was conserved in its entirety, some of the charts came out as images as opposed to editable charts. Consequently, they do not show up at all in OmegaT. In cases like this, do I need to recreate the charts manually in the resulting Word document before uploading it to OmegaT?

Addit
... See more
Hello,

I'm completing a practice project in OmegaT 4.3.2, in which I am translating a Word document that I had converted from a PDF using Wordfast Anywhere. Although almost all the formatting was conserved in its entirety, some of the charts came out as images as opposed to editable charts. Consequently, they do not show up at all in OmegaT. In cases like this, do I need to recreate the charts manually in the resulting Word document before uploading it to OmegaT?

Additionally, in the charts that did come out correctly, OmegaT has segmented by sentence but also by line of each box, so there are several sentences that came out as two or more segments. That is, just say a sentence within a box of the chart goes into three different lines of the box. OmegaT has divided that sentence into three different segments, when it should just be one. Is there a way to fix that without putting in a "quick-fix rule" for every single sentence under the segmentation section, which would be extremely tedious and time-consuming? I did notice that in these charts, some of the boxes were subdivided into a few different boxes, so I merged the sub-boxes together in the source document within OmegaT. I think that may have helped just a little bit, but for the most part it did not fix the problem. Thanks in advance!
Collapse


 
esperantisto
esperantisto  Identity Verified
Local time: 01:55
Member (2006)
English to Russian
+ ...
SITE LOCALIZER
Charts Jul 22, 2020

Charts, as any other embedded objects that have no translatable content, will be reproduced in target documents as they appear in respective source documents. Obviously, you will have to edit / replace them if anything requires translation.

[Адрэдагавана 2020-07-22 08:07 GMT]


 
Susan Welsh
Susan Welsh  Identity Verified
United States
Local time: 18:55
Russian to English
+ ...
OCR quality? Jul 22, 2020

I suspect that your problems here and with headers/footers may have to do with the quality of your OCR conversion from PDF to Word. For what it's worth to you, here is my checklist for OCR from PDF via ABBYY Finereader (I made this after struggling with endless problems):
1. Mark footnotes and callouts on hard copy.
2. In Finereader, remove footers.
3. Spellcheck in source language.
4. Scroll through and fix OCR errors.
5. Save as "formatted" or "editable" text (or
... See more
I suspect that your problems here and with headers/footers may have to do with the quality of your OCR conversion from PDF to Word. For what it's worth to you, here is my checklist for OCR from PDF via ABBYY Finereader (I made this after struggling with endless problems):
1. Mark footnotes and callouts on hard copy.
2. In Finereader, remove footers.
3. Spellcheck in source language.
4. Scroll through and fix OCR errors.
5. Save as "formatted" or "editable" text (or both, if there are tables).
6. Proof PDF against printout of source text for missing copy, format problems.
Collapse


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 23:55
Member (2006)
English to Afrikaans
+ ...
@Nina Jul 22, 2020

Nina Halperin wrote:
Do I need to recreate the charts manually in the resulting Word document before uploading it to OmegaT?


Yes, OmegaT can only translate editable text. So if your PDF converter fails to convert some parts of the file, you have to edit the converted file in e.g. Microsoft Word and fix (or type in, or recreate) the parts that were not converted properly, before loading it into OmegaT. The same thing is true of other CAT tools.

Additionally, in the charts that did come out correctly, OmegaT has segmented by sentence but also by line of each box, so there are several sentences that came out as two or more segments.


Yes, OmegaT splits text into segments when there is a line break. You have to fix your charts or tables so that there are no line breaks in the middle of sentences or phrases, before loading the file into OmegaT. There may be CAT tools that are smart enough to know that all text inside a table cell should be treated as a single sentence, but I don't know of any.

esperantisto wrote:
Charts, as any other embedded objects that have no translatable content, will be reproduced in target documents as they appear in respective source documents.


While I agree that embedded objects will not be translated (but OmegaT does handle text boxes correctly), I suspect Nina's converted charts are not embedded objects, since Nina's file was converted from PDF using Wordfast Anywhere.


[Edited at 2020-07-22 18:31 GMT]


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
mQ etc. Jul 22, 2020

.



Yes, OmegaT splits text into segments when there is a line break. You have to fix your charts or tables so that there are no line breaks in the middle of sentences or phrases, before loading the file into OmegaT. There may be CAT tools that are smart enough to know that all text inside a table cell should be treated as a single sentence, but I don't know of any .


https://helpcenter.memoq.com/hc/en-us/articles/360010377519-Importing-the-content-of-an-Excel-spreadsheet-on-a-cell-by-cell-basis

Probably Transit too.


 
Nina Halperin
Nina Halperin  Identity Verified
Peru
Local time: 17:55
Spanish to English
+ ...
TOPIC STARTER
A few questions about eliminating line breaks Jul 22, 2020

Thank you so much to everyone for your replies. Ok, I see that I would have to recreate any parts of the resulting Word document, including charts, that did not come out as editable text.

In the source document within OmegaT, I clicked on the paragraph sign and then saw that there was a paragraph sign at the end of every line in the chart, so I deleted those and then reloaded the source document. That seems to have fixed the problem! Now the segments are as they should be, with two
... See more
Thank you so much to everyone for your replies. Ok, I see that I would have to recreate any parts of the resulting Word document, including charts, that did not come out as editable text.

In the source document within OmegaT, I clicked on the paragraph sign and then saw that there was a paragraph sign at the end of every line in the chart, so I deleted those and then reloaded the source document. That seems to have fixed the problem! Now the segments are as they should be, with two tags in the space where the line break had been.

I also reopened the original document that I had gotten from Wordfast Anywhere and investigated the charts with the paragraph sign enabled, but in that case there were no paragraph signs at the end of every line, but rather little blue circles with four lines sticking out of them. I could not erase them. I think those little circles indicate the end of a cell, because, like I mentioned, in the original Wordfast document every cell got separated into sub-cells. In fact, it appears that every line within each of the original cells got separated into its own sub-cell. Correct me if I'm wrong, but it seems that it was necessary to first merge all the sub-cells into one larger cell like I did originally and then eliminate the paragraph sign at the end of every line. In the case of a text with a lot of charts, this whole process seems like it could be time-consuming. Is there no work-around?

Hans, I took a look at the link you sent. Is there a comparable option for OmegaT?

Susan, are those things one is able to do without ABBYY Finereader? For example, how do you save a Word document as "formatted" or "editable" text? Are you saying that, if I do that in Word, it will preserve the formatting in OmegaT?

Does anyone have suggestions for a free PDF-to-Word converter that might be better quality than Wordfast Anywhere? I used the latter because I had seen it recommended in a ProZ forum. Like I said, overall it maintained the formatting really well, at least for this document. On the other hand, I tried converting an educational document called an IEP, which has a ton of charts and check boxes, with Wordfast Anywhere and the resulting document was of pretty poor quality. Thanks so much again!
Collapse


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 23:55
Member (2006)
English to Afrikaans
+ ...
@Nina Jul 23, 2020

Nina Halperin wrote:
It seems that it was necessary to first merge all the sub-cells into one larger cell like I did originally and then eliminate the paragraph sign at the end of every line. In the case of a text with a lot of charts, this whole process seems like it could be time-consuming. Is there no work-around?


No, you have to either fix the tables that were created by the OCR program, or you can have recreate the tables from scratch and then copy/type the text into the newly created tables.

Hans, I took a look at the link you sent. Is there a comparable option for OmegaT?


AFAICT, Hans' link relates to Excel only (not Word). No, there is no such option in OmegaT.

How do you save a Word document as "formatted" or "editable" text? Are you saying that, if I do that in Word, it will preserve the formatting in OmegaT?


Susan was referring to tweaking settings in a proper OCR program, such as FineReader. In the OCR program, you can view every page after it was scanned and before it is converted, to tell the program how it should handle each table. You can also choose to convert with *more* formatting retained (which is less good for CAT tools) or with *less* formatting retained (which is better for CAT tools but requires more work to re-format the final file).

Does anyone have suggestions for a free PDF-to-Word converter that might be better quality than Wordfast Anywhere?


All OCR programs struggle with some types of files. There are online PDF-to-DOC converters that offer OCR. Sometimes, your printer comes bundled with a free or demo version of an OCR program.


 
Nina Halperin
Nina Halperin  Identity Verified
Peru
Local time: 17:55
Spanish to English
+ ...
TOPIC STARTER
Is it necessary to merge all the sub-cells into one larger cell? Jul 23, 2020

Hi Samuel, thank you so much for your response. I was just wondering if you could confirm this question I posed in my last post: Correct me if I'm wrong, but it seems that it was necessary to first merge all the sub-cells into one larger cell like I did originally and then eliminate the paragraph sign at the end of every line.

 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 23:55
Member (2006)
English to Afrikaans
+ ...
@Nina Jul 23, 2020

Nina Halperin wrote:
It seems that it was necessary to first merge all the sub-cells into one larger cell like I did originally and then eliminate the paragraph sign at the end of every line.


You can do whatever you want, but the fact is that the OCR system you're using splits sentences in two, and OmeagT can't unsplit them. So, either translate them while they're split (not ideal) or fix them so that they are no longer split. And one way to fix sentences split over multiple cells is to merge the cells.


 
Nina Halperin
Nina Halperin  Identity Verified
Peru
Local time: 17:55
Spanish to English
+ ...
TOPIC STARTER
Thank you Jul 23, 2020

Ok perfect, thanks Samuel!

 
Stanislav Okhvat
Stanislav Okhvat
Local time: 02:55
English to Russian
Preparing Word documents converted from PDF Jul 24, 2020

Hello Nina,

Last year I presented for UTIC Webinars on how to convert PDF to Word with Finereader and prepare the converted document for CAT tools. Specifically, the presentation talks about removing those incorrect paragraph and line breaks in a faster way using TransTools that I developed. Here is the link to the webinar recording an
... See more
Hello Nina,

Last year I presented for UTIC Webinars on how to convert PDF to Word with Finereader and prepare the converted document for CAT tools. Specifically, the presentation talks about removing those incorrect paragraph and line breaks in a faster way using TransTools that I developed. Here is the link to the webinar recording and downloadable materials: How to convert PDF to Word format and prepare it for translation properly.

Hope it helps.

Best regards,
Stanislav Okhvat
TransTools – Useful tools for every translator
Collapse


 


There is no moderator assigned specifically to this forum.
To report site rules violations or get help, please contact site staff »


Trouble with charts in OmegaT






TM-Town
Manage your TMs and Terms ... and boost your translation business

Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

More info »
Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »