archive-ca.com » CA » E » EVANJONES.CA

Total: 397

Choose link from "Titles, links and description words view":

Or switch to "Titles and links view".
  • Intel Consumer SSDs: Not appropriate for databases (evanjones.ca)
    it seemed to recover from this error With the cache disabled no data loss was observed with 4 kB writes Maybe this means that 4 kB writes are actually durable or that I just didn t manage to observe the problematic behavior Testing Methodology A program runs that writes a sequence number to disk then reports the number While it is running you crash the system reboot and check what data exists on disk If sequence number x was reported as written then the last value written should be x x 1 or some partial version of x 1 If the last complete record is less than x then data has been lost I did the following test five times for each configuration Start logfilecrashserver on a workstation logfilecrashserver 12345 Start minlogcrash on the system under test minlogcrash tmp workstation 12345 131072 Once the workstation starts receiving log records pull the power from the back of the disk Power off the system my system doesn t support hotplug so pulling the power on the disk makes it unhappy if your system supports hotplug this may not be needed Reconnected power to the disk Turn on the test system Observe the output file using hexdump The output of hexdump should show that the file has at least the last record reported by logfilecrashserver Durable Writes In order for a write to actually be durable then all the layers in the stack need to cooperate so that save to disk actually means I mean it make sure all the stuff is on the disk so that I can read it back if something bad happens On Linux with the ext3 and ext4 file systems this actually works when write barriers are enabled the default on ext4 must be manually enabled on ext3 This feature causes the operating system to issue a CACHE FLUSH command to the disk when an application calls fsync or fdatasync This is exactly what databases and my test program do If the disk works correctly it waits until all the data is actually written to the disk then it acknowledges that the flush operation has completed Similar Reports I am not the only person to observe these problems with Intel SSDs Others have found that the X25 E and the X25 M G2 both lose data with the write cache enabled However it appears that I am the first to report this type of problem with the write cache disabled perhaps because I am the first to test writes that are larger than 4 kB I ve reported this to ext4 developers in an attempt to make sure I m not making an error Similar issues have been reported for magnetic disks in the past although with write barriers enabled the default for ext4 use the barrier 1 mount option for ext3 this should not be a problem However it is still worth testing your configuration My test program was inspired by brad s diskchecker Additional Boring Test

    Original URL path: http://www.evanjones.ca/intel-ssd-durability.html (2016-04-30)
    Open archived version from archive


  • Write Latency and Block Size (evanjones.ca)
    the kernel to read the page in from disk before performing the write This is terribly slow because it requires a disk seek 8 ms on my system However if you write entire pages then the kernel is fast as it recognizes that it doesn t need to read the old page in It still takes longer than modifying a cached page 25 µs VS 10 µs likely because it needs to find and zero a free page before performing the copy Finally if you want to perform I O that avoids the page cache Linux provides the O DIRECT option to open or fcntl In this case it doesn t matter if the page is in the cache or not but you must perform I O in blocks that are a multiple of 512 bytes long corresponding to the disk sector size This is because this interface literally passes the write to the disk immediately rather than modifying the page cache I m guessing that on newer disks that have 4096 byte sectors the I O will need to be 4 k aligned Mac OS X provides a similar feature via fcntl s F NOCACHE option although its implementation is different It permits writes of any size but performs terribly if you perform writes that are not a multiple of 4 kB Similar issues will appear with RAID In the case of RAID5 or 6 I ll call the amount of data written to a disk the chunk size and I ll use stripe size to refer to number of data disks times the chunk size Unfortunately different RAID systems use different terms for these concepts A write of an entire stripe can simply be passed through to the disks A write that touches less than the entire stripe

    Original URL path: http://www.evanjones.ca/write-latency-alignment.html (2016-04-30)
    Open archived version from archive

  • More Than You Wanted to Know About Checksums (evanjones.ca)
    slower to compute Cyclic redundancy checks CRCs are functions that are a good middle ground They have been well studied by mathematicians and can be easily implemented in hardware In particular the CRC known as CRC32C Castagnoli has been found to be particularly strong for communication applications This particular CRC is now implemented in hardware in Intel CPUs as part of the SSE4 2 extensions This hardware support changes things significantly On CPUs that support the CRC32 instruction computing this relatively strong checksum is even cheaper than computing Adler 32 checksums Thus there is no need to use these weaker functions As these CPUs are replaced and upgraded this hardware instruction will be available everywhere The only risk is that currently AMD Via and Intel Atom CPUs do not support this instruction Thus a software fallback is needed and on these CPUs the computation will be slower than functions based on Adler 32 or Fletcher s checksum Software Implementation Implementing CRC32C efficiently has been well studied The best algorithm is called Slicing by 8 which uses an 8 kB lookup table The author s experimental results from their paper show that even with cold caches this outperforms the typical algorithm with a 1 kB lookup table as long as the message is longer than approximately 200 bytes I ve taken Intel s source code and ported it to GCC and included it in my implementation see link below Using the CRC32 instruction is straightforward using GCC s built in functions I experimented with this a bit and it turns out there are two tweaks for getting slightly better performance First aligned memory accesses are typically faster However my results show that for the CRC32 instruction it only matters for data blocks larger than 4096 bytes and even then the difference

    Original URL path: http://www.evanjones.ca/crc32c.html (2016-04-30)
    Open archived version from archive



  • Linux Write Caching (evanjones.ca)
    files in proc sys vm that are adjustable While the percentage of dirty pages is less than dirty background ratio default 10 on my system then dirty pages stay in memory until they are older than dirty expire centisecs default 30 seconds The pdflush kernel process wakes up every dirty writeback centisecs to flush these expired pages out If a writing process dirties enough pages that the percentage rises above dirty background ratio then it proactively wakes pdflush to start writing data out in the background If the percentage of dirty pages rises above dirty ratio default 20 on my system then the writing process itself will synchronously write pages out to disk This puts the process in uninterruptable sleep indicated by a D in top The CPU will be shown in the iowait state This is actually idle time if there were processes that needed CPU they would be scheduled to run The percentages are of the total reclaimable memory free active inactive from proc meminfo On a 32 bit system the high memory region is excluded if vm highmem is dirtyable is 0 default References The Linux Page Cache and pdflush Slightly dated but still mostly correct Provides a detailed explanation Understanding the Linux Kernel Third Edition Useful and detailed explanations of the kernel internals Section 15 3 Writing Dirty Pages to Disk is where you should start but Section 16 1 Reading and Writing a File is also informative mm page writeback c the actual kernel source that implements this logic See balance dirty pages Other Random Notes On my system with the ext3 file system the journal commit interval default 5 seconds causes dirty pages to be written out by kjournald Passing the commit seconds option to mount adjusts this time This does not appear to

    Original URL path: http://www.evanjones.ca/linux-write-caching.html (2016-04-30)
    Open archived version from archive

  • Java String Encoding Internals (evanjones.ca)
    by accessing the raw char array in the String object The UTF 8 bytes are copied from the temporary byte array into the final byte array with the exact right length Conclusion This allocates s length 4 bytes of garbage and has an extra copy This is what permits custom code to be slightly faster than the JDK custom code produces less garbage particularly for ASCII or mostly ASCII text Significant wins are possible when the destination does not need to be a byte array with the exact length For example writing directly to the output buffer or a ByteBuffer with unused bytes at the end can be faster See my StringEncoder class in the source code used for these benchmarks if you want to try and take advantage of this in your own code The details with links to the source code String getBytes calls StringCoding encode charsetName value offset count This method gets a cached StringCoding StringEncoding object stored in a static ThreadLocal SoftReference StringEncoder This is a good trick for thread specific encoders since it permits the JVM to garbage collect them if it is under memory pressure This object wraps a Charset and a CharsetEncoder It also sets an isTrusted boolean to true if the Charset is provided by the JDK charset getClass getClassLoader0 null StringCoding encode then checks that the charset string matches the one passed in If not create a new StringEncoder after looking up the Charset by name Finally it calls StringEncoder encode chars offset length StringEncoder encode allocates a byte array of size length encoder maxBytesPerChar Note that for UTF 8 the JDK reports that maxBytesPerChar 4 Checks if the character set isTrusted If it is not it makes a defensive copy of the input string This is to prevent a user

    Original URL path: http://www.evanjones.ca/software/java-string-encoding-internals.html (2016-04-30)
    Open archived version from archive

  • Debugging with a Circular Buffer (evanjones.ca)
    could possibly end up in that state I first tried adding printf statements to my program to trace the execution but generated tons of output although in retrospect tail probably could have solved that part of the problem and ended up slowing the program down enough that the bug did not happen To solve this I ended up logging data to a per thread circular buffer This was fast enough to still hit the bug and then I could look at the last entries in the circular buffer to figure out how the program got stuck My program was written in C and conveniently each thread already had a state structure to add a buffer The buffer ended up looking like this char event log 16 100 int next event I then used a macro to add an entry to the log define LOG EVENT con message snprintf con event log con next event sizeof con event log message VA ARGS con next event if con next event EVENT LOG SIZE con next event 0 Anywhere I wanted to log something I can call this macro like a printf statement LOG EVENT con drizzle state row read pkt d d

    Original URL path: http://www.evanjones.ca/debugging-circular-buffer.html (2016-04-30)
    Open archived version from archive

  • Ken Thompson's "Trusting Trust" Quine (evanjones.ca)
    code Bootstrap the quine by saving this as quine c include stdio h int main int i printf char s n for i 0 s i i printf t d n s i printf s s return 0 Save the following Python script as quine py Run python quine py quine c header t0 n n n d header open quine c read for c in d print t s repr c print t0 Now you have the contents of the s array appended to quine c Edit quine c to make the array char s generated output goes here include stdio h continues Compile this to produce quine1 gcc o quine1 quine c Run quine1 to produce quine2 c quine1 quine2 c Compile quine2 c to produce quine2 gcc o quine2 quine2 c Run quine2 to produce quine3 c which is exactly the same as quine2 c quine2 quine3 c Compare the two cmp quine2 c quine3 c or diff u quine2 c quine3 c They should be identical You can skip the quine generating program quine c by changing quine py to generate the numbers directly but this is different from Thompson s version in his paper header t0 n n n d header open quine c read for c in d print t d ord c print t0 The quine works because the s array contains a representation of the program starting at the 0 terminator which is the header inserted by the Python script through to the end of the program The program itself prints the definition of the s array char s then loops through each character in s up to but not including the 0 printing its representation t d n This correctly reproduces the first part of the program Finally the program prints

    Original URL path: http://www.evanjones.ca/quines-trusting-trust.html (2016-04-30)
    Open archived version from archive

  • Buffer Overflows 101 (evanjones.ca)
    get stack smashing detected buffer terminated Compile it without stack smashing protection gcc fno stack protector U FORTIFY SOURCE o buffer buffer c see Ubuntu s compile flag documentation for more information about these options Crash it by typing more than 1 line of characters I now get Segmentation fault as expected However notice that each time you run it the buffer address changes Optional Look at the assembly code gcc S fno stack protector U FORTIFY SOURCE Os buffer c The source is in buffer s Disable address space randomization sudo bin sh c echo 0 proc sys kernel randomize va space You may want to re enable this when you are done changing this command to echo 1 Run buffer and the buffer s address should now be constant on my 32 bit machine it is always 0xbffff4c4 Compile hack c gcc o hack hack c Crash buffer with the hack hack buffer address diff buffer On a modern machine particularly 64 bit machines you may get Segmentation fault This may mean you need to disable the no execute stack by adding Wa execstack to the compile command line You can verify that it worked with execstack q buffer It should show an X in front of the binary name If you are running on a 64 bit machine the diff parameter is useless due to the way that parameters are passed This means you might need to hack the hard coded offset in hack c Sorry Ideally you now have a shell try typing ls Typing CTRL D will terminate it This isn t too exciting but if buffer had been connected to the network this would be a shell on another computer You can actually sort of make this work On one computer run nc l

    Original URL path: http://www.evanjones.ca/buffer-overflow-101.html (2016-04-30)
    Open archived version from archive