AnalyzeAssist can segment* the following file types. It will skip numbers by default; see Segmentation Rules for details on how to change the segmentation rules. See Configuring File Extensions for instructions on changing file associations for AnalyzeAssist.
* Segment: Separate a file into segments, which are translation units generally equivalent to sentences
- Text Files
- Supported encodings include:
- UTF-8
- UTF-16
- UTF-16 (big-endian)
- Microsoft Word Files
- AnalyzeAssist will also extract text from text boxes.
- Microsoft Excel Files
- AnalyzeAssist will also extract text from shapes on each worksheet.
- Microsoft PowerPoint Files
- AnalyzeAssist will also extract text from MS Word/Excel objects embedded in PowerPoint slides, although results for textboxes/shapes further embedded in these objects are not guaranteed.
- HTML Files
- Extracts the text displayed in the browser, including the document title, and "alt" and "title" tags in links/images.
- XML Files
- Extracts text data from the xml nodes.