Html text extraction (CAT Tools Technical Help)

Forumuri tehnice » CAT Tools Technical Help »
Html text extraction
Track this topic

Html text extraction

Inițiatorul discuției: MIGUEL JIMENEZ

MIGUEL JIMENEZ

Local time: 10:46
din engleză în spaniolă
+ ...

Nov 29, 2005

Hi,
I am trying to compile a big corpus of text extracted from webpages for some research. I was wondering if anybody would know what would be the best tool to extract "translatable" text only from html pages and create a separate text file. This would be for research purpouses only, I would not need to convert back to html, xml or anything.
Thanks for your help

Marc P (X)

Local time: 16:46
din germană în engleză
+ ...

Html text extraction

Nov 29, 2005

Why not simply open the html file in a suitable word processor (such as OpenOffice.org) and "Save As" plain text? Are there too many files?

Marc

Rodolfo Raya

Local time: 11:46
din engleză în spaniolă

Word processors miss text

Nov 29, 2005

MarcPrior wrote:

Why not simply open the html file in a suitable word processor (such as OpenOffice.org) and "Save As" plain text? Are there too many files?

Marc

Word processors usually ignore translateable attributes, such as "alt" in images.

Rodolfo

Sonja Tomaskovic (X)

Germania
Local time: 16:46
din engleză în germană
+ ...

..	Nov 29, 2005

Rodolfo Raya wrote:
Word processors usually ignore translateable attributes, such as "alt" in images.

I doubt that someone who needs "to compile a big corpus of text extracted from webpages for some research" needs the alt attribute of an image.

Another solution would be to open the file with a text editor and remove all html tags. This should be able with regexp.

Sonja

Samuel Murray

Ţările de Jos
Local time: 16:46
Membru (2006)
din engleză în afrikaans
+ ...

HTML2TXT and a DOS command

Nov 29, 2005

MIGUEL JIMENEZ wrote:
I am trying to compile a big corpus of text extracted from webpages for some research. I was wondering if anybody would know what would be the best tool to extract "translatable" text only from html pages and create a separate text file.

In two steps.

1. Download Bobsoft's HTML2TXT and use it to convert all the html files into text files:
http://www.bobsoft.com/h2t/

2. Merge all the text files into a single file, using the following DOS command:
copy *.* > /b all.txt

Good luck!

Samuel Murray

Ţările de Jos
Local time: 16:46
Membru (2006)
din engleză în afrikaans
+ ...

Try Caterpillar (shareware)

Nov 29, 2005

Rodolfo Raya wrote:
Word processors usually ignore translateable attributes, such as "alt" in images.

Well, in that case, try Caterpillar by Stormdance. The web site says "Extracts all text requiring translation - including hidden text and text within tags etc."

The shareware version is limited to 8 HTML pages per project, though. Cost for full version is GBP 25.00. The author claims it is "Wordfast compatible".

Login to reply/comment

To report site rules violations or get help, contact a site moderator:

Moderatorii acestui forum
Natalie	[Call to this topic]
Peter Zauner	[Call to this topic]
Prachya Mruetusatorn	[Call to this topic]

You can also contact site staff by submitting a support request »

Html text extraction

Translation news related to CAT tools

» Memsource Sells to Carlyle: The Inside Story
(0 comments)
» memoQ 9.4: Turbo-Charging Productivity
(0 comments)
» The Future Of Work Now: The Computer-Assisted Translator And Lilt
(0 comments)

Submit translation news about CAT tools »
Read more translation news »

Forum rules

Help and orientation

Anycount & Translation Office 3000
Translation Office 3000 Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators. More info »

Wordfast Pro
Translation Memory Software for Any Platform Exclusive discount for ProZ.com users! Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value Buy now! »

Mesaje recente | Întrebări frecvente | Reguli | Moderatori | Biblioteca de articole

Your current localization setting

română

Select a language

More languages...

Html text extraction

Html text extraction

You have native languages that can be verified

Your current localization setting

Select a language