×

Discussion Board

Results 1 to 13 of 13
  1. #1
    Regular Contributor
    Join Date
    Jul 2003
    Posts
    63

    Reading a .txt file from jar without knowing the size of the file !

    I have to read form jar a lot of .txt files. Not all of them once (since they are too large to be kept all during the application run), but at different moments. Reading them byte by byte, like this :
    InputStream is;
    ……………….
    is = getClass().getResourceAsStream(location);
    StringBuffer str = new StringBuffer();
    byte b[] = new byte[1];
    while (is.read(b) != -1) {
    str.append(new String(b));
    }
    …………………
    is very slow !!!

    Reading the whole file like this :

    is = getClass().getResourceAsStream(location);
    byte[] b = new byte[fileSize];
    is.read(b,0,fileSize);

    works fine but I have to know the size of the files-and knowing the files size in the code is not a solution for me because the size of the .txt file may change often and I don't want to have to change the code too.

    I’ve tried using a byte array with a standard length of 6000 bytes (since my biggest .txt file has around 5800), try to read the whole 6000 bytes, catch the exception and return what was read till the exception;works, but when I have to read from many files once, I ran out of memory.

    The first approach works very well in all situations, but is much slower.

  2. #2
    Registered User
    Join Date
    Jul 2003
    Location
    Finland, Tampere
    Posts
    1,113
    ionutianasi
    works fine but I have to know the size of the files-and knowing the files size in the code is not a solution for me because the size of the .txt file may change often and I don't want to have to change the code too.

    You can use first 2 bytes of the file to put there file lengths. Of course, this means that you'll have to keep this value up to date when changing file.

    Have other tip. You can try a compromise: read neither whole file at once, nor byte-by-byte. Try reading some 100-500 bytes at once.

    I’ve tried using a byte array with a standard length of 6000 bytes (since my biggest .txt file has around 5800), try to read the whole 6000 bytes, catch the exception and return what was read till the exception;works, but when I have to read from many files once, I ran out of memory.

    Hmm, why do you run out of memory?
    I think, if you new size and used is.read(b,0,fileSize); you would also run out of memory.

    I can' understand why using fixed array of 6000 bytes is much different from allocating it dynamically.

  3. #3
    Regular Contributor
    Join Date
    Jul 2003
    Posts
    63

    doctordwarf

    At some point I have to read from 3 different files : and each time I allocate7k; maybe the garbage collector doesn’t keep up.
    Only 3,4 files are around 7k the rest are close to 1k so allocating dynamically sometimes saves a lot of memory.

    I though too about keeping the first bytes for the length; if I can't find anything else I guess I'll stic to this solution...

  4. #4
    Super Contributor
    Join Date
    Mar 2003
    Location
    Israel
    Posts
    2,280
    How about using ByteArrayInputStream and ByteArrayOutputStream?
    Then you can read into a byte array without first knowing the size of the file.

    shmoove

  5. #5
    Super Contributor
    Join Date
    Jun 2003
    Location
    Cheshire, UK
    Posts
    7,395
    About the worst thing you can do in Java for performance is call a method, which is why reading one byte at a time is slow. Large numbers of calls to InputStream.read() are half the problem; the same number of calls to StringBuffer.append() or ByteArrayOutputStream.write() are the other half. However, ByteArrayOutputStream gives you the advantage that you can work in larger chunks (and it stores an array of bytes, not chars, so it's half the size). I use ByteArray streams a lot for this kind of thing.
    Code:
    private byte [] readFile (String sFilename) throws IOException {
        InputStream in = getClass ().getResourceAsStream (sFilename);
        ByteArrayOutputStream out = new ByteArrayOutputStream (1024);
    
        byte [] aoBuffer = new byte [512];
    
        int nBytesRead;
    
        while ( (nBytesRead = in.read (aoBuffer)) > 0 ) {
            out.write (aoBuffer, 0, nBytesRead);
        }
    
        in.close ();
    
        return out.toByteArray ();
    }
    Remember that toByteArray() makes a copy of the array in the ByteArrayOutputStream object, which increases the memory requirement.

    All dynamic array-type objects (ByteArray streams, StringBuffers, Vectors) in Java work by creating a small array, then when it fills they create another, larger one and copy the old data across; the old array is then garbaged. They can create a lot of garbage, and the final array size is bigger than needed. In J2SE, ByteArrayOutputStreams grow exponentially - I don't know if the J2ME implementation works the same way.

    You can't be all that short on memory if you can get away with using StringBuffers... that's likely to be very expensive on memory.

    Graham.

  6. #6
    Registered User
    Join Date
    Nov 2003
    Location
    UK
    Posts
    27
    I like your readFile function but I can't find a way to load up strings. I have saved them in NotePad in UTF-8 format but readUTF( ) doesn't work. It doesn't even throw an exception - I get an Unable to Run Application come up on the emulator!

    bytearray = readFile(sFile);
    bais = new ByteArrayInputStream(bytearray);
    dis = new DataInputStream(bais);

    String s = dis.readUTF(); // bad


    I can read bytes from the data input stream no problem and my strings are there...

    There is a newline (10) and a carriage return (13) between my strings.

    The first three numbers of the file are:
    239 187 and 191 which I presume is the file header and has been added to the start by notepad?



    Any ideas why this isn't working??


    Thanks,
    Buffalo

  7. #7
    Super Contributor
    Join Date
    Jun 2003
    Location
    Cheshire, UK
    Posts
    7,395
    readUTF() will only read data written by writeUTF(), because string-lengths get added to the front of each string written (so they need to be there to read back.

    Also, beware that Notepad adds a character \ufeff to the start of files saved in unicode-based formats. (These are the three characters you're seeing at the start of the file.)

    There is a discussion of text-resources as http://discussion.forum.nokia.com/fo...threadid=29350.

    Graham.

  8. #8
    Registered User
    Join Date
    Mar 2004
    Posts
    37
    grahamhughes

    is it possible to read alway the hole line with your readFile() method even if I don't know the number of bytes of one line in my .txt-file?

    Thanks,
    Heike

  9. #9
    Regular Contributor
    Join Date
    Aug 2003
    Location
    uk
    Posts
    232

    whole line

    If you want to do things on a line by line basis and your text file isn't huge, I would recommend loading the whole thing into memory and writing your own parse methods.

    Example, if you have the file in memory and converted into a char array:

    Code:
    public static String getNextLine(char buf[], int ends[])
    {
    	int i = ends[0];
    	int k = ends[1];
    
    	// terminated by \n
    	while(buf[i] == '\n' && buf[i] > 0 && i < buf.length) 
    		i++;
    	k=i;
    	while(buf[k] != '\n' && buf[k] > 0 && k < buf.length) 
    		k++;
    
    	//if (buf[k] == 0 || k >= buf.length)
    	if (i == k)
    		return null;
    
    	ends[0] = k;
    	ends[1] = k;
    
    	return new String(buf, i, k-i);
    }
    will give you line by line programming as per this example.

    Code:
    String s = new String(loadBinaryFile("/foo.txt"));
    char buf[] = s.toCharArray(); 
    s=null;
    int ends[] = {0,0};
    
    // remove white space, \r, empty lines, comments etc		
    preProcess_TextFile(buf);
    
    
    while( (s=getNextLine(buf, ends)) != null )
    {
    	// s == current line
    }
    The preprocessing function (which must be called if only to make sure the text only uses \n and not \r\n follows)

    Code:
    public static void preProcess_TextFile(char buf[])
    {
    	//preProcess_removeCarriageReturns(buf);
    	int i = 0, k = 0; 
    	boolean inquote;
    	while(i < buf.length)
    	{
    		buf[k] = buf[i];
    		if (buf[i] != '\r')
    			k++;
    		i++;
    	}
    	while(k < buf.length)	buf[k++]='\0';
    	//buf[k]='\0';
    
    
    	//preProcess_ReprocessSpecialChars(buf);
    	i = 0; // eg [\][n] becomes [\n]
    	k = 0;
    	while(i < buf.length-1)
    	{
    		buf[k] = buf[i];
    		if (buf[i] == '\\' && buf[i+1] == 'n')
    		{
    			buf[k] = '\n';
    			i++;
    		}
    		i++;k++;
    	}
    	while(k < buf.length)	
    		buf[k++]='\0';
    
    
    	//preProcess_tolower(buf);
    	i=0;
    	inquote = false;
    	while(i < buf.length)
    	{
    		if (buf[i] == '"')
    			inquote = !inquote;
    		else if (!inquote)
    			if (buf[i] >= 'A' && buf[i] <= 'Z')
    				buf[i] = (char)( (int)buf[i] + 'a' - 'A');
    		i++;
    	}
    
    
    	//preProcess_removeComments(buf);		
    	i = 0;
    	while(i < buf.length)
    	{
    		while(i < buf.length && buf[i] != '#')
    			i++;
    		while(i < buf.length && buf[i] != '\n')
    			buf[i++] = ' ';			
    	}
    
    
    	//preProcess_removeWhiteSpace(buf);
    	i = 0;
    	k = 0;
    	inquote = false;
    	while(i < buf.length)
    	{
    		buf[k] = buf[i];
    		
    		if (buf[k] == '"')
    			inquote = !inquote;
    		
    		if (inquote || (buf[i] != ' ' && buf[i] != '\t') )
    			k++;
    		i++;
    	}
    	while(k < buf.length)	
    		buf[k++]='\0';
    
    
    	//preProcess_cutEmptyLines(buf);
    	i = 0;
    	k = 0;
    	inquote = false;
    	while(i < buf.length)
    	{
    		buf[k] = buf[i];
    
    		if (buf[k] == '"')
    			inquote = !inquote;
    		
    		if (inquote || buf[i] != '\n' || buf[i+1] != '\n') 
    			k++; // advance k (dest) if [i] and [i+1] are not both \n
    		
    		i++;
    	}
    	while(k < buf.length)	
    		buf[k++]='\0';
    }
    Alex
    Last edited by alex_crowther; 2004-05-23 at 14:18.

  10. #10
    Registered User
    Join Date
    Mar 2004
    Posts
    37
    Hi Alex,

    Thanks a lot for your help and your code.

    I have still some more questions to this subject:

    What would you define as a "huge" file?

    Does the processing work faster if I read the whole file instead of one line after the other?

    For organizing my data I think it's helpful to use the line numbers, that's why I thought about reading line by line but perhaps I should think about another way to combine data from .txt with data from RMS like numbers as a kind of index?

    In case of needing only certain parts of my .txt, is there a possibility to jump directly to certain line numbers or other identifyers like a number as a kind of index or will I have to read the whole file and filter the needed data e.g. by a for loop.

    I hope someone can help me. Thanks in adance.
    Heike

  11. #11
    Regular Contributor
    Join Date
    Aug 2003
    Location
    uk
    Posts
    232
    Well "huge" would be defined relative the amount of spare heap you have available.

    If the text file will use most/all of the available heap then you will most likely run out of memory.

    Speed wise, its assumed the whole file is in memory.
    The reason for this is, when you open a stream, the whole file is copied into memory anyway, hense you may as well just read the whole thing in and close the stream as, if you are going to have the whole file in memory you might as well have full access to it, instead of being limited by the stream methods.

    If you want to check this, check memory inuse before and after opening a stream, after opening, it should go down by the size of the file being opened, meaning the whole file has been copied into memory when you opened it, and stream operations are infact really memory operations.

    Regarding line numbers .. not sure what it is you are doing so I cannot comment. The code I wrote was designed for processing a text file of definitions in a line by line manor. Only lines containing non-white space were of interest, which is why the whipe space lines are removed etc ..

    Alex

  12. #12
    Registered User
    Join Date
    Mar 2004
    Posts
    37
    Interesting to know that the app always reads the whole file into memory. So I can also do it in my code.

    Concerning line numbers: I'm doing a vocabulary coach. Therefore I'll use the line numbers as identifiers to combine words with results from rms. Is this ok for explanation?

    Heike

  13. #13
    Registered User
    Join Date
    May 2004
    Posts
    10

    Reading from binary file is faster

    Hey ,

    Try converting the .txt file to a binary and read in chuncks of data from the .bin file using ByteArrayInputStream() method.

    You can either push this InputStream into the DataInputStream and read in floats or you can apply techniques to read ints, floats, strings directly from the Byte array.

    I believe its the fastest method for reading a resource file. Plus when converting to a .bin file; the size of a .txt automatically gets reduced.

    I am still a novice J2ME programmer, so comments are welcome.

    Cheers,
    kiks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
×